Automated Transcript Assembly Pipeline with Dependencies

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

yetAnotherAutoTranscriptAssemblyPipeline

Requirements

ffq v0.2.1
FastQC v0.11.8
BBDuk v35.85
Kraken2 v2.1.2
ContFree-NGS.py v1.0
Trinity v2.8.5
CD-HIT-EST v4.8.1
BUSCO v5
transrate v1.0.3
Salmon v1.3.0
Python 3.x

References

Köster, J., Rahmann, S. (2012) Snakemake - a scalable bioinformatics workflow engine, Bioinformatics, Volume 28, Issue 19, 1 October 2012, Pag 2520–2522 - https://doi.org/10.1093/bioinformatics/bts480

Code Snippets

shell:
	"""
	cd MyAssembly_{params.genotype}/1_raw_reads_in_fastq_format && \
	{ffq} --ftp {wildcards.sample} | grep -Eo '\"url\": \"[^\"]*\"' | grep -o '\"[^\"]*\"$' | xargs wget && \
	gzip -dc < {wildcards.sample}_1.fastq.gz > {wildcards.sample}_1.fastq && \
	gzip -dc < {wildcards.sample}_2.fastq.gz > {wildcards.sample}_2.fastq && \
	cd -
	"""

SnakeMake From line 72 of main/Snakefile

shell:
	"{fastqc} -f fastq {input.R1} -o MyAssembly_{params.genotype}/2_raw_reads_fastqc_reports 2> {log};"
	"{fastqc} -f fastq {input.R2} -o MyAssembly_{params.genotype}/2_raw_reads_fastqc_reports 2> {log}"

SnakeMake From line 102 of main/Snakefile

shell:
	"""
	/usr/bin/time -v {salmon} index -t {input.transcriptome} -p {threads} -i {output.salmon_index} > {log} 2>&1
	"""

SnakeMake From line 123 of main/Snakefile

shell:
	"""
	/usr/bin/time -v {salmon} quant -p {threads} -i {input.salmon_index} -l A -1 {input.R1} -2 {input.R2} -o datasets_{params.genotype}/3_salmon/quant/{wildcards.sample} > {log} 2>&1
	"""

SnakeMake Quant From line 149 of main/Snakefile

shell:
	"""
	{jq} -r '.library_types[]' {input.meta_info} > MyAssembly_{params.genotype}/3_salmon/quant/lib.txt 2>> {log}
	ls MyAssembly_{wildcards.genotype}/3_salmon/quant/ | grep -E "(SRR|ERR)" > MyAssembly_{params.genotype}/3_salmon/quant/id.txt 2>> {log}
	paste MyAssembly_{params.genotype}/3_salmon/quant/id.txt MyAssembly_{params.genotype}/3_salmon/quant/lib.txt -d, > MyAssembly_{params.genotype}/3_salmon/quant/stranded_status.csv 2>> {log}
	grep .S MyAssembly_{params.genotype}/3_salmon/quant/stranded_status.csv > MyAssembly_{params.genotype}/3_salmon/quant/stranded_samples.csv 2>> {log}
	cut -f1 -d, MyAssembly_{params.genotype}/3_salmon/quant/stranded_samples.csv | paste -s -d, > MyAssembly_{params.genotype}/3_salmon/quant/{params.genotype}_srrlist.csv 2>> {log}
	"""

SnakeMake From line 172 of main/Snakefile

shell:
	"{bbduk} -Xmx40g threads={threads} in1={input.R1} in2={input.R2} "
	"refstats={output.refstats} stats={output.stats} "
	"out1={output.R1} out2={output.R2} "
	"rref=/Storage/progs/bbmap_35.85/resources/adapters.fa "
	"fref=/Storage/progs/sortmerna-2.1b/rRNA_databases/rfam-5.8s-database-id98.fasta,"
	"/Storage/progs/sortmerna-2.1b/rRNA_databases/silva-bac-16s-id90.fasta,"
	"/Storage/progs/sortmerna-2.1b/rRNA_databases/rfam-5s-database-id98.fasta,"
	"/Storage/progs/sortmerna-2.1b/rRNA_databases/silva-bac-23s-id98.fasta,"
	"/Storage/progs/sortmerna-2.1b/rRNA_databases/silva-arc-16s-id95.fasta,"
	"/Storage/progs/sortmerna-2.1b/rRNA_databases/silva-euk-18s-id95.fasta,"
	"/Storage/progs/sortmerna-2.1b/rRNA_databases/silva-arc-23s-id98.fasta,"
	"/Storage/progs/sortmerna-2.1b/rRNA_databases/silva-euk-28s-id98.fasta "
	"minlength=75 qtrim=w trimq=20 tpe tbo 2> {log}"

SnakeMake From line 215 of main/Snakefile

shell:
	"{fastqc} -f fastq {input.R1} -o MyAssembly_{params.genotype}/4_trimmed_reads_fastqc_reports 2> {log};"
	"{fastqc} -f fastq {input.R2} -o MyAssembly_{params.genotype}/4_trimmed_reads_fastqc_reports 2> {log}"

SnakeMake From line 251 of main/Snakefile

shell:
	"{kraken2} --db /Storage/data1/felipe.peres/kraken2/completeDB "
	"--threads {threads} --report-zero-counts --confidence 0.05 --output {output} --paired {input.R1} {input.R2} 2> {log}"

SnakeMake From line 273 of main/Snakefile

shell:
	"split --number=l/10 -d --additional-suffix=.kraken {input} MyAssembly_{params.genotype}/5_trimmed_reads_kraken_reports/parts/{params.identificator}.trimmed_ 2> {log}"

SnakeMake From line 296 of main/Snakefile

shell:
	"{create_index} -R1 {input.R1} -R2 {input.R2} -o MyAssembly_{params.genotype}/6_contamination_removal/index/ 2> {log}"

SnakeMake From line 318 of main/Snakefile

shell:
	"python3.8 {contfree_ngs} --taxonomy {input.kraken_file} --s p --R1 {input.R1} --R2 {input.R2} --taxon Viridiplantae -o MyAssembly_{params.genotype}/6_contamination_removal/parts/ 2> {log};"

SnakeMake From line 343 of main/Snakefile

shell:
	"cat {input.filtered_parts_R1} >> {output.filtered_total_R1};"
	"cat {input.filtered_parts_R2} >> {output.filtered_total_R2};"
	"cat {input.unclassified_parts_R1} >> {output.unclassified_total_R1};"
	"cat {input.unclassified_parts_R2} >> {output.unclassified_total_R2}"

SnakeMake From line 368 of main/Snakefile

shell:
	"/usr/bin/time -v {trinity} --seqType fq --left {params.filtered_total_R1},{params.unclassified_total_R1} --right {params.filtered_total_R2},{params.unclassified_total_R2} --SS_lib_type RF --max_memory 10G --min_contig_length 200 --CPU {threads} --output 7_trinity_assembly/MyAssembly_{params.genotype}_trinity_k25 --full_cleanup --no_normalize_reads --KMER_SIZE 25 2> {log.k25};"
	"/usr/bin/time -v {trinity} --seqType fq --left {params.filtered_total_R1},{params.unclassified_total_R1} --right {params.filtered_total_R2},{params.unclassified_total_R2} --SS_lib_type RF --max_memory 10G --min_contig_length 200 --CPU {threads} --output 7_trinity_assembly/MyAssembly_{params.genotype}_trinity_k25 --full_cleanup --no_normalize_reads --KMER_SIZE 31 2> {log.k31}"

SnakeMake From line 405 of main/Snakefile

shell:
	"sed 's/>/>k25_{params.genotype}_/' {input.k25} > {output.mod_k25};"
	"sed 's/>/>k31_{params.genotype}_/' {input.k31} > {output.mod_k31};"
	"cat {output.mod_k25} {output.mod_k31} > {output.merged_mod};"
	"/usr/bin/time -v {cd_hit_est} -i {output.merged_mod} -o {output.final_cd_hit_est} -c 1 -n 11 -T {threads} -M 0 -d 0 -r 0 -g 1"

SnakeMake From line 430 of main/Snakefile

shell:
	"{extract_contigs} -f {input.transcriptome} -m 301 2> {log}"

SnakeMake From line 452 of main/Snakefile

shell:
	"/usr/bin/time -v run_BUSCO.py -i {input.transcriptome} -o {output.busco} -c {threads} -m transcriptome -l /Storage/databases/BUSCO_DBs/embryophyta_odb9/ 2> {log}"		

SnakeMake From line 471 of main/Snakefile

shell:
	"/usr/bin/time -v {transrate} --assembly {input.transcriptome} --reference {input.ref} --threads {threads} --output {output.transrate} 2> {log}"

SnakeMake From line 491 of main/Snakefile

shell:
	"/usr/bin/time -v {salmon} index -t {input.transcriptome} -p {threads} -i {output.salmon_index} --gencode 2> {log}"

SnakeMake From line 511 of main/Snakefile

shell:
	"/usr/bin/time -v {salmon} quant -i {input.salmon_index} -l A -1 {params.filtered_total_R1} {params.unclassified_total_R1} -2 {params.filtered_total_R2} {params.unclassified_total_R2} --validateMappings -o {output.salmon_quant} 2> {log}"