Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Pipeline to infer poly(A) site clusters through processing of 3' end sequencing libraries prepared according to various protocols. The pipeline was used for the generation of the PolyASite atlas.
Pipeline schematic
Further information on the implemented processing can be found
below
and on
PolyASite
.
Requirements
The pipeline was tested on an HPC environment managed by Slurm .
Conda environment
We recommend that to use a Conda environment that contains the necessary software.
The environment was created with:
conda env create \
--name polyA_atlas_pipeline \
--file snakemake_run_env_requirements.yaml
Activate the environment with:
source activate polyA_atlas_pipeline
Deactivate the environment with:
source deactivate
Prerequisites
The pipeline has three central elements: the Snakefile , a config file , and a sample table :
-
The Snakefile does not need to be changed, modified or updated unless bugs, intended updates etc., require changes.
-
The config file (called
config.yaml
) needs to be adjusted according to your needs. It requires a set of samples , an organism , a genome version and an annotation version . Organism and genome/annotation versions are edited directly in the config file. Samples are provided indirectly by indicating the path to the sample table . An example for a config file can be found here:tests/EXAMPLE_config.yaml
.
Please see below in the Run section part how you can also run Snakemake without modifying the config file, instead specifying the required information as arguments to command line parameters. -
The sample table lists sample-specific information required for the successful run of the pipeline. Follow the format in
tests/EXAMPLE_samples.tsv
and include one line for each sample to be processed.
Run the pipeline
Local
Local execution is not recommended and should only be used for testing purposes.
snakemake \
-p \
--use-singularity \
--singularity-args "--bind ${PWD}" \
--configfile config.yaml \
&>> run_update.Organism_genomeVersion_annotationVersion.log
Slurm
snakemake \
-p \
--use-singularity \
--singularity-args "--bind ${PWD}" \
--configfile config.yaml \
--cluster-config cluster_config.json \
--jobscript jobscript.sh \
--cores 500 \
--local-cores 10 \
--cluster "sbatch --cpus-per-task {cluster.threads} \
--mem {cluster.mem} --qos {cluster.queue} \
--time {cluster.time} -o {params.cluster_log} \
-p [NODE_ID] --export=JOB_NAME={rule} \
--open-mode=append" \
&>> run_update.Organism_genomeVersion_annotationVersion.log
General notes on running the pipeline
Instant config changes
If you do not want to change the base config file, you can also specify the appropriate values in the Snakemake command itself, e.g.:
snakemake \
-p \
--use-singularity \
--singularity-args "--bind ${PWD}" \
--configfile config.yaml \
--config organism=HomoSapiens genome=GRCh38.90 atlas.release_name:r2.0 -- \
&>> run_update.Organism_genomeVersion_annotationVersion.log
Running only parts of the pipeline
With the Snakemake option
--until
you can specify a target rule for running the pipeline.
This is useful if you only want to run a subset of rules. For example, the pipeline includes
the creation of bigWig- and track-info files for display of data in the UCSC genome browser.
If you don't need these files, run with
--until complete_clustering
. For examples see
here
.
Generate graph for preprocessing of individual sample of specific protocol
When a target file is provided in the Snakemake command, Snakemake will run the pipeline only until the point at which the desired file is created. This can be used to generate overview graphs on the processing of samples from specific protocols to check their processing steps. For example:
snakemake \
-p \
--configfile config.yaml \
--dag \
samples/counts/SRX517313_GRCh38.90.ip3pSites.out \
| dot \
-T png \
> graph.SAPAS.png
Protocol-specific notes
3' READS
Pre-processing involves:
-
Filtering of reads based on the 5' configuration
-
3' adapter trimming
-
Reverse complementing
-
3' trimming of potentially remaining
A
s from the poly(A) tail
Only reads that start with a specified number of random nucleotides and two
T
s are considered, with the number of random nucleotides having to be extracted from the corresponding GEO/SRA entries or publication. See here for more information.
According to this paper , sequencing can be done in sense and antisense direction. The samples that are currently processed here were sequenced in antisense direction. Future samples should be checked carefully in order to decide whether current settings are appropriate.
SAPAS
Pre-processing involves:
-
Combined 5' and 3' adapter trimming
-
Trimming of remaining Cs at the 3' end (they are a result of template switching reverse transcription)
-
Reverse complementing
Sequencing libraries are prepared such that sequencing can be done either in antisense direction (Illumina) or in sense direction (454). So far, only samples from Illumina sequencing have been processed. If "sense direction samples" need to be processed, the pipeline must be adapted accordingly.
In the supplementary material of the APASdb paper , the authors state that they only consider reads that have the expected 5' linker sequence
5'-TTTTCTTTTTTCTTTTTT-3'
. However, manual comparison of the first reads from an old sample (SRX026584
) with one from the mentioned publication revealed that the abundance of this linker is very low in the samples from the APASdb publication. Therefore, we have decided not to be too strict with the 5' linker. Very often, only poly(T)s are found: therefore, for the newer samples only a stretch of Ts is given as 5' adapter and trimmed together with the 3' adapter.
A-seq2
Processing is done according to this protocol , but without using the first nucleotide as barcode information.
A-seq
Pre-processing involves:
-
3' adapter trimming
-
Valid reads filtering with additional consideration of a maximum read length (see below).
3'-seq (Mayr)
Pre-processing involves:
-
3' adapter trimming
-
Additional trimming of As and Ns at the 3' end
-
Valid reads filtering with additional consideration of a maximum read length (see below).
PAS-Seq
Pre-processing involves:
-
Trimming of poly(T) at the 5' end
-
3' adapter trimming
-
Trimming of additional Cs at the 3' end
-
Reverse complementing
DRS
Current pre-processing involves:
-
Reverse complementing
-
Correction by 1 nt to obtain the true 3' end position (note that this is directly encoded in the Snakefile, not the config file.)
The protocol facilitates direct sequencing of the RNA 3' end. Due to an initial
T
-fill step that involves the incorporation of a blocking nucleotide (anything except for aT
), sequencing begins actually one nucleotide upstream of the RNA 3'-most nucleotide. Therefore, a correction of 1 in the downstream direction of the read's reverse complement is necessary to obtain the 3' end (See "Extended Experimental Procedures" in Ozsolak et al for more info.
PolyA-seq
Pre-processing involves:
-
3' adapter trimming
-
Reverse complementing
3P-Seq
Pre-processing involves:
-
Reverse complementing if necessary
-
Filtering reads: only proceed with reads with at least 2 As at the 3' end
-
Remove additional
A
s at the 3' end that might remain from the poly(A) tail
Note that processing for samples of this protocol is sample-specific . In particular, only a subset of samples requires reverse complementing, and hence, each sample has to be checked manually to infer whether reverse complementing is required or not. Once, this information is provided in the design file, the pipeline will process samples accordingly.
An easy way to check whether a file needs to be revese complemented is to count the occurence of the poly(A) signal in the first reads and their revese complements:
zcat samples/GSM1268942/GSM1268942.fa.gz \ | head -n10000 \ | tail -n1000 \ | grep TTTATT | wc -l zcat samples/GSM1268942/GSM1268942.fa.gz \ | head -n10000 \ | tail -n1000 \ | grep AATAAA | wc -lComparing these numbers should give a clear prefernce for one of the two signals.
2P-Seq
Pre-processing involves:
-
3' adapter trimming
-
Reverse complementing
Maximum read length
Samples prepared with protocols 3'-seq (Mayr) and A-seq have a restriction on the maximum read length for processed reads to count as valid . As these protocols require sequencing in the sense direction, the length restriction ensures that the 3' end of the transcript is reached.
Code Snippets
104 105 106 107 108 109 110 111 | shell: ''' (gffread \ -w {output.fasta} \ -g {input.fasta} \ {input.gtf}) \ &> {log} ''' |
131 132 133 134 135 136 137 138 | shell: ''' (python {input.script} \ --trim \ -i {input.fasta} \ -o {output.fasta}) \ &> {log} ''' |
163 164 165 166 167 168 169 | shell: ''' (segemehl.x \ -x {output.idx} \ -d {input.fasta}) \ &> {log} ''' |
192 193 194 195 196 | shell: "(segemehl.x \ -x {output.idx} \ -d {input.sequence}) \ &> {log}" |
216 217 218 219 220 221 222 223 224 | shell: ''' (bash {input.script} \ -f {input.gtf} \ -c 3 \ -p exon \ -o {output.exons} ) \ &> {log} ''' |
244 245 246 247 248 249 250 | shell: ''' (Rscript {input.script} \ --gtf {input.exons} \ -o {output.exons}) \ &> {log} ''' |
269 270 271 272 273 274 | shell: ''' (samtools dict \ -o {output.header} {input.genome}) \ &> {log} ''' |
312 313 314 315 316 317 318 319 320 | shell: ''' segemehl.x \ -i {input.idx} \ -d {input.genome} \ -t {threads} \ -q {input.reads} \ -outfile {output.gmap} ''' |
345 346 347 348 349 350 351 | shell: ''' samtools view \ {input.gmap} \ > {output.gmap} \ 2> {log} ''' |
386 387 388 389 390 391 392 393 394 | shell: ''' segemehl.x \ -i {input.idx} \ -d {input.transcriptome} \ -t {threads} \ -q {input.reads} \ -outfile {output.tmap} ''' |
418 419 420 421 422 423 424 | shell: ''' samtools view \ {input.tmap} \ > {output.tmap} \ 2> {log} ''' |
457 458 459 460 461 462 463 464 | shell: ''' (perl {input.script} \ --in {input.tmap} \ --exons {input.exons} \ --out {output.genout}) \ &> {log} ''' |
493 494 495 496 497 498 499 500 | shell: ''' (cat {input.header} \ {input.t2gmap} \ {input.gmap} \ > {output.catmaps}) \ &> {log} ''' |
525 526 527 528 529 530 531 532 | shell: ''' (samtools sort \ -n \ -o {output.sorted} \ {input.sam}) \ &> {log} ''' |
565 566 567 568 569 570 571 572 573 | shell: ''' (perl {input.script} \ --print-header \ --keep-mm \ --in {input.sorted} \ --out {output.remove_inf}) \ &> {log} ''' |
597 598 599 600 601 602 603 | shell: ''' (samtools view \ -b {input.remove_inf} \ > {output.bam}) \ &> {log} ''' |
628 629 630 631 632 633 634 | shell: ''' (samtools sort \ {input.bam} \ > {output.bam}) \ &> {log} ''' |
674 675 676 677 678 679 680 | shell: ''' (samtools view {input.bam} \ | python {input.script} \ --processors {threads} \ | gzip > {output.reads_bed}) 2>> {log} ''' |
257 258 259 260 261 | shell: ''' mkdir -p {params.cluster_samples_log} mkdir -p {params.cluster_countings_log} ''' |
280 281 282 283 | shell: ''' mkdir -p {params.cluster_atlas_log} ''' |
301 302 303 304 305 306 307 308 309 310 | shell: ''' wget -O {output.temp_genome} \ {params.url} \ &> /dev/null && gzip -cd {output.temp_genome} \ > {output.genome} && sed 's/\s.*//' {output.genome} \ > {output.clean} ''' |
327 328 329 330 331 332 333 334 | shell: ''' wget -O {output.temp_anno} \ {params.url} \ &> /dev/null && gzip -cd {output.temp_anno} \ > {output.anno} ''' |
359 360 361 362 363 364 365 366 367 | shell: ''' perl {input.script} \ --type_id={params.type_id} \ {params.types} \ {params.tr_supp_level_id} {params.tr_supp_level} \ {input.anno} \ > {output.filtered_anno} ''' |
417 418 419 420 421 422 423 424 | shell: ''' python3 {input.script} \ --srr_id {params.srr_id} \ --outdir {params.outdir} --paired \ 2> {log} ''' |
451 452 453 454 455 456 457 | shell: ''' python3 {input.script} \ --srr_id {params.srr_id} \ --outdir {params.outdir} \ 2> {log} ''' |
485 486 487 488 489 490 491 492 493 494 495 | shell: ''' cd {params.file_dir} IFS=',' read -ra SRR <<< "{params.sample_srr}" if [[ "${{#SRR[@]}}" > "1" ]];then first_file="${{SRR[0]}}.fastq.gz" for i in $(seq 1 $((${{#SRR[@]}}-1))); do curr_file="${{SRR[$i]}}.fastq.gz"; cat ${{curr_file}} >> ${{first_file}};done fi ln -fs {params.first_srr}.fastq.gz {params.sample_id}.fq.gz cd - ''' |
530 531 532 533 534 535 536 537 | shell: ''' (zcat {input.sample_fq} \ | {input.script} \ | fastx_renamer -n COUNT -z \ > {output.sample_fa}) 2> {log} ''' |
558 559 560 561 562 563 564 | run: import gzip n = 0 with gzip.open(input.sample_fa, "rt") as infile: n = sum([1 for line in infile if line.startswith(">")]) with open(output.raw_cnt, "w") as out: out.write("reads.raw.nr\t%i\n" % n) |
619 620 621 622 623 624 625 | shell: ''' (zcat {input.sample_fa} \ | perl {input.script} \ --adapter={params.adapt} \ | gzip > {output.selected_5p}) 2> {log} ''' |
663 664 665 666 667 668 669 | shell: ''' (zcat {input.sample_fa} \ | perl {input.script} \ --adapter={params.adapt} \ | gzip > {output.trimmed_5p}) 2> {log} ''' |
700 701 702 703 704 705 706 707 | shell: ''' zcat {input.sample_fa} \ | perl {input.script} \ --minLen={params.minLen} \ --nuc={params.adapt} \ | gzip > {output.nuc_trimmed} ''' |
736 737 738 739 740 741 742 743 744 | shell: ''' cutadapt \ -g {params.adapt} \ --minimum-length {params.minLen} \ -o {output.no_5p_adapter} \ {input.in_fa} \ &> {log} ''' |
774 775 776 777 778 779 780 781 782 783 | shell: ''' cutadapt \ -a {params.adapt} \ {params.five_p_adapt} \ --minimum-length {params.minLen} \ -o {output.no_3p_adapter} \ {input.in_fa} \ &> {log} ''' |
815 816 817 818 819 820 821 822 | shell: ''' zcat {input.no_3p_adapter} \ | perl {input.script} \ --minLen={params.minLen} \ --nuc={params.adapt} \ | gzip > {output.nuc_trimmed} ''' |
846 847 848 849 850 851 852 | shell: ''' zcat {input.input_seqs} \ | fastx_reverse_complement -z \ -o {output.rev_cmpl} \ &> {log} ''' |
879 880 881 882 883 884 885 | shell: ''' (zcat {input.in_fa} \ | perl {input.script} \ --adapter={params.adapt} \ | gzip > {output.selected_3p}) 2> {log} ''' |
917 918 919 920 921 922 923 924 925 926 927 | shell: ''' cutadapt \ --adapter {params.adapt} \ --minimum-length {params.minLen} \ --overlap {params.min_overlap} \ -e {params.error_rate} \ -o {output.no_polyAtail} \ {input.no_3p_adapter} \ &> {log} ''' |
949 950 951 952 953 954 955 956 957 | run: import gzip n = 0 with gzip.open(input.in_fa, "rt") as infile: n = sum([1 for line in infile if line.startswith(">")]) with open(output.trimmed_cnt, "w") as out: with open(input.prev_cnt, "r") as cnt: out.write("%s" % cnt.read() ) out.write("reads.trim.out\t%i\n" % n) |
995 996 997 998 999 1000 1001 1002 1003 1004 | shell: ''' (zcat {input.valid_rds_in} \ | perl {input.script_filter} \ --max {params.maxN} --nuc N \ | perl {input.script_filter} \ --max {params.maxAcontent} --nuc A \ | perl {input.script_last} \ | gzip > {output.valid_reads}) 2> {log} ''' |
1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 | shell: ''' (zcat {input.valid_rds_in} \ | perl {input.script_len_filter} --max={params.maxLen} \ | perl {input.script_filter} \ --max={params.maxN} --nuc=N \ | perl {input.script_filter} \ --max={params.maxAcontent} --nuc=A \ | perl {input.script_last} \ | gzip > {output.valid_reads}) 2> {log} ''' |
1081 1082 1083 1084 1085 1086 1087 1088 | run: import gzip n = 0 with gzip.open(input.in_fa, "rt") as infile: n = sum([1 for line in infile if line.startswith(">")]) with open(output.valid_cnt, "w") as out, open(input.prev_cnt, "r") as cnt: out.write("%s" % cnt.read() ) out.write("reads.valid.nr\t%i\n" % n) |
1151 1152 1153 1154 1155 1156 | shell: ''' (python {input.script} \ --bed {input.reads_bed} \ | gzip > {output.unique_bed}) 2>> {log} ''' |
1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 | run: import gzip unique = 0 mapped = {} with gzip.open(input.reads_bed, "rt") as in_all: total_mapped = {line.split("\t")[3]:1 for line in in_all.readlines()} with gzip.open(input.unique_bed, "rt") as in_bed: unique = sum([1 for line in in_bed]) multi = len(total_mapped) - unique with open(output.mapped_cnt, "w") as out, open(input.prev_cnt, "r") as cnt: out.write("%s" % cnt.read() ) out.write("reads.mapped.uniqueMappers.nr\t%i\n" % unique) out.write("reads.mapped.multiMappers.nr\t%i\n" % multi) |
1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 | shell: ''' (perl {input.script} \ {params.exclude_chr} \ --correction={params.correction} \ --strict \ --min_align={params.min_align} \ {input.unique_bed} \ | gzip > {output.end_sites}) 2>> {log} ''' |
1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 | run: import gzip plus = 0 plus_reads = 0 minus = 0 minus_reads = 0 with gzip.open(input.end_sites, "rt") as in_bed: for line in in_bed: F = line.rstrip().split("\t") if F[5] == "+": plus += 1 plus_reads += float(F[4]) else: minus += 1 minus_reads += float(F[4]) with open(output.sites_cnt, "w") as out, open(input.prev_cnt, "r") as cnt: out.write("%s" % cnt.read() ) out.write("sites.highconfidence.number.plus\t%i\n" % plus) out.write("sites.highconfidence.number.minus\t%i\n" % minus) out.write("sites.highconfidence.reads.plus\t%i\n" % plus_reads) out.write("sites.highconfidence.reads.minus\t%i\n" % minus_reads) |
1335 1336 1337 1338 1339 1340 1341 1342 1343 | shell: ''' (perl {input.script} \ --genome={input.genome} \ --upstream={params.upstream_ext} \ --downstream={params.downstream_ext} \ {input.ends} \ | gzip > {output.seqs}) 2>> {log} ''' |
1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 | shell: ''' (perl {input.script} \ --upstream_len={params.upstream_ext} \ --downstream_len={params.downstream_ext} \ --consecutive_As={params.consec_As} \ --total_As={params.tot_As} \ {params.ds_patterns} \ {input.seqs} \ | gzip > {output.ip_assigned}) 2>> {log} ''' |
1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 | run: import gzip plus = 0 plus_reads = 0 minus = 0 minus_reads = 0 with gzip.open(input.end_sites, "rt") as in_bed: for line in in_bed: F = line.rstrip().split("\t") if F[3] == "IP": if F[5] == "+": plus += 1 plus_reads += float(F[4]) else: minus += 1 minus_reads += float(F[4]) with open(output.ip_cnt, "w") as out, open(input.prev_cnt, "r") as cnt: out.write("%s" % cnt.read() ) out.write("sites.highconfidence.internalpriming.number.plus\t%i\n" % plus) out.write("sites.highconfidence.internalpriming.number.minus\t%i\n" % minus) out.write("sites.highconfidence.internalpriming.reads.plus\t%i\n" % plus_reads) out.write("sites.highconfidence.internalpriming.reads.minus\t%i\n" % minus_reads) |
1453 1454 1455 1456 1457 1458 1459 | shell: ''' echo '#########################\n \ Pre-processing completed.\n#########################\n \ Created "{input.counts}"' \ > {output.prepro_cmplt} ''' |
1503 1504 1505 1506 1507 1508 1509 | shell: ''' (perl {input.script} \ --noip \ {input.files} \ | gzip > {output.pooled_sites}) 2>> {log} ''' |
1528 1529 1530 1531 1532 1533 1534 | run: import gzip n = 0 with gzip.open(input.pooled_sites, "rt") as infile: n = sum([1 for line in infile if not line.startswith("#")]) with open(output.pooled_sites_cnt, "w") as out: out.write("3pSites.pooled:\t%i\n" % n) |
1568 1569 1570 1571 1572 1573 1574 1575 | shell: ''' (perl {input.script} \ {params.signals} \ --genome={params.genome} \ {input.pooled_sites} \ | gzip > {output.sites_with_pas}) 2>> {log} ''' |
1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 | shell: ''' perl {input.script} \ --cutoff={params.cutoff} \ --upstream={params.upstream_reg} \ --downstream={params.downstream_reg} \ --sample={wildcards.sample} \ {input.sites_with_pas} \ > {output.sites_filtered} 2>> {log} ''' |
1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 | run: import gzip with gzip.open(output.table_filtered, "wt") as out_file, gzip.open(input.table_adjusted, "rt") as infile: for line in infile: if line.startswith("#"): out_file.write(line) continue line_list = line.rstrip().split("\t") read_sum = sum( [1 for i in line_list[3:-2] if float(i) > 0] ) if read_sum > 0: # this site has still read support out_file.write(line) |
1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 | run: import gzip sites = 0 reads = 0 pas = 0 pas_reads = 0 col = 0 with gzip.open(input.noBG_sites,"rt") as all_sites: for line in all_sites: if line.startswith("#"): if params.sample in line: F = line.rstrip().split(";") col = int(F[0].lstrip("#")) else: if col == 0: print("Column for sample could not be identified!") print(params.sample) exit() else: line_list = line.rstrip().split("\t") if line_list[col] != "0": sites += 1 reads += int(line_list[col]) if line_list[-2] != "NA": pas += 1 |
1769 1770 1771 1772 1773 1774 1775 1776 | run: import gzip n = 0 with gzip.open(input.noBG_sites, "rt") as infile: n = sum([1 for line in infile if not line.startswith("#")]) with open(output.noBG_sites_cnt, "w") as out, open(input.prev_cnt, "r") as cnt: out.write("%s" % cnt.read() ) out.write("3pSites.noBG:\t%i\n" % n) |
1808 1809 1810 1811 1812 1813 1814 1815 | shell: ''' (perl {input.script} \ --upstream={params.upstream_ext} \ --downstream={params.downstream_ext} \ {input.table_filtered} \ | gzip > {output.primary_clusters}) 2> {log} ''' |
1837 1838 1839 1840 1841 1842 1843 1844 | run: import gzip n = 0 with gzip.open(input.clusters, "rt") as infile: n = sum([1 for line in infile if not line.startswith("#")]) with open(output.clusters_cnt, "w") as out, open(input.prev_cnt, "r") as cnt: out.write("%s" % cnt.read() ) out.write("clusters.primary:\t%i\n" % n) |
1888 1889 1890 1891 1892 1893 1894 1895 | shell: ''' (perl {input.script} \ --minDistToPAS={params.minDistToPAS} \ --maxsize={params.maxsize} \ {input.primary_clusters} \ | gzip > {output.merged_clusters}) 2> {log} ''' |
1917 1918 1919 1920 1921 1922 1923 1924 | run: import gzip n = 0 with gzip.open(input.clusters, "rt") as infile: n = sum([1 for line in infile if not line.startswith("#")]) with open(output.clusters_cnt, "w") as out, open(input.prev_cnt, "r") as cnt: out.write("%s" % cnt.read() ) out.write("clusters.merged:\t%i\n" % n) |
1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 | shell: ''' python {input.script} \ --verbose \ --gtf {input.anno} \ --ds-range {params.downstream_region} \ --input {input.merged_clusters} \ | gzip > {output.clusters_annotated} \ 2> {log} ''' |
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 | shell: ''' python {input.script} \ --verbose \ --design={params.design_file} \ --in {input.clusters_annotated} \ --out {output.clusters_temp} \ 2> {log} && gzip -c {output.clusters_temp} \ > {output.clusters_support} ''' |
2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 | shell: ''' python {input.script} \ -i {input.clusters} \ -s {params.id} \ -o {output.samples_temp} \ 2> {log} && gzip -c {output.samples_temp} \ > {output.samples_bed} ''' |
2071 2072 2073 2074 2075 2076 2077 | shell: ''' sortBed \ -i {input.bed} \ | gzip \ > {output.sorted_bed} ''' |
2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 | run: import gzip n = 0 p = 0 annos = {'TE': 0, 'EX': 0, 'IN': 0, 'DS': 0, 'AE': 0, 'AI': 0, 'AU': 0, 'IG': 0} with gzip.open(input.clusters_bed, "rt") as infile: for line in infile: # Count clusters n += 1 # Count clusters with PAS if not "NA" in line: p += 1 # For each cluster get annotation a = line.split('\t')[9] annos[a] += 1 with open(output.clusters_cnt, "w") as out, open(input.prev_cnt, "r") as cnt: out.write("{}".format(cnt.read() )) out.write("clusters.all:\t{:d}\n".format(n)) out.write("clusters.PAS.nr:\t{:d}\n".format(p)) out.write("clusters.PAS.percent:\t{:d}\n".format(int(p/n*100))) # For put in mongo we need int for k in annos.keys(): out.write("clusters.annos.%s:\t%s\n" % (k, annos[k])) |
2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 | shell: ''' echo '#########################\n \ Clustering completed.\n \ #########################\n \ Created "{input.noBG_cnt}"\n \ "{input.cluster_stats}"\n \ "{input.clusters_bed}"\n \ "{input.samples_bed}"\n' \ > {output.clst_cmplt} ''' |
2185 2186 2187 2188 2189 2190 | shell: ''' wget -O {output.chr_sizes_ucsc} \ {params.url} \ &> /dev/null ''' |
2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 | shell: ''' python {input.script} \ -i {input.clusters} \ -s {params.id} \ --chr-names {params.chr_names} \ -p {output.plus} \ -m {output.minus} \ 2> {log} ''' |
2259 2260 2261 2262 2263 2264 | shell: ''' sortBed \ -i {input.ucsc_bed} \ > {output.sorted_bed} ''' |
2291 2292 2293 2294 2295 2296 2297 | shell: ''' bedGraphToBigWig \ {input.ucsc_bed} \ {input.chr_sizes} \ {output.bigWig} ''' |
2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 | run: with open(output.track_info, "wt") as out: out.write('track type=bigWig name="%s: poly(A) clusters plus strand %s" \ visibility="full" color="4,177,216" maxHeightPixels="128:60:8"\ bigDataUrl="%s/%s"\n'\ % (params.name, params.atlas_public_name, params.url, params.plus)) out.write('track type=bigWig name="%s: poly(A) clusters minus strand %s" \ visibility="full" color="241,78,50" maxHeightPixels="128:60:8"\ bigDataUrl="%s/%s"\n' \ % (params.name, params.atlas_public_name, params.url, params.minus)) |
2352 2353 2354 2355 2356 2357 2358 2359 2360 | shell: ''' echo '#########################\n \ Track files completed.\n \ #########################\n \ Created "{input.atlas_track}"\n \ "{input.sample_tracks}"\n' \ > {output.tracks_cmplt} ''' |
Support
- Future updates
Related Workflows





