atavide: Comprehensive Snakemake Workflow for Advanced Metagenomics Analysis and Annotation
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
atavide
is a complete workflow for metagenomics data analysis, including QC/QA, optional host removal, assembly and cross-assembly, and individual read based annotations. We have also built in some advanced analytics including tools to assign annotations from reads to contigs, and to generate metagenome-assembled genomes in several different ways, giving you the power to explore your data!
atavide
is 100% snakemake and conda, so you only need to install the snakemake workflow, and then everything else will be installed with conda.
Steps:
- QC/QA with prinseq++
- optional host removal using bowtie2 and samtools, as described previously . To enable this, you need to provide a path to the host db and a host db.
Metagenome assembly
- pairwise assembly of each sample using megahit
- extraction of all reads that do not assemble using samtools flags
- assembly of all unassembled reads using megahit
- compilation of all contigs into a single unified set using Flye
- comparison of reads -> contigs to generate coverage
MAG creation
Read-based annotations
Want something else added to the suite? File an issue on github and we'll add it ASAP!
Installation
You will need to install
- The NCBI taxonomy database somewhere
- The superfocus databases somewhere, and set the SUPERFOCUS_DB environmental variable
Everything else should install automatically.
Code Snippets
14 15 16 17 | shell: """ python3 {params.sct} --file {input} --output {output.h5} --header --dataset contigs --indexfile {output.idx} """ |
27 28 29 30 31 | shell: """ turbocor compute {input} {output.tmp} --dataset contigs; turbocor topk 1000000 {output.tmp} > {output.pearson} """ |
43 44 45 46 47 | shell: """ python3 {params.sct} --file {input} --output {output} \ --separator , --threshold {params.pearson_threshold} """ |
60 61 62 63 64 | shell: """ python3 {params.sct} --fasta {input.fa} --clusters {input.cl} \ --directory {output.directory} --idx {input.idx} """ |
23 24 25 26 27 | shell: """ mkdir --parents {params.basename}; jgi_summarize_bam_contig_depths --outputDepth {output.depth} {input.bam} """ |
42 43 44 45 46 | shell: """ touch {output} metabat2 -i {input.contigs} -a {input.depth} -m 1500 -o {params.base} -t {threads} """ |
61 62 63 64 | shell: """ concoct --composition_file {input.contigs} --coverage_file {input.tsv} -b {params.od} -t {threads} """ |
76 77 78 79 80 | shell: """ touch {output} extract_fasta_bins.py {input.contigs} {input.odir} --output_path {params.base} """ |
22 23 24 25 26 27 28 29 | shell: """ python {params.sct} --bam {input.bam} \ --singlem {input.singlem} \ --kraken {input.kraken} \ --output {params.out} \ --verbose """ |
44 45 46 47 48 | shell: """ mkdir -p {output.d}; focus -q {input.r1} -q {input.r2} -o {output.d} -t {threads} """ |
73 74 75 76 | shell: """ for F in {input}; do zip -j $F.zip $F; done """ |
19 20 21 22 23 | shell: """ mkdir -p {params.d}; seqtk fqchk {input.r1} > {output} """ |
33 34 35 36 | shell: """ cut -f 1,8 {input} | tail -n +2 | grep -v ALL | sed -e 's/avgQ/{params.s}/' > {output} """ |
46 47 48 49 | shell: """ perl {params.sct} -h {input} > {output} """ |
33 34 35 36 | shell: """ seqtk sample {input.r1} {params.f} > {output} """ |
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | shell: """ kraken2 --report {output.rt} \ --threads {threads} \ {input.r1} """ """ I think the next three rules will make this work, but need to double check the rule. Essentially we just mix some bash in here too! cd RBADIR WD=$PWD; for SEQ in FAME0000*; do echo $SEQ; cd $SEQ/kraken; \ echo -e "Fraction\t$SEQ" > $SEQ.kraken_rarefaction.tsv; \ for FRX in 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9; \ do awk -v F=$FRX '$4 == "S" {c++} END {print F,"\t",c}' $SEQ.report.$FRX.tsv; \ done >> $SEQ.kraken_rarefaction.tsv; \ awk '$4 == "S" {c++} END {print "1.0\t",c}' $SEQ.report.tsv >> $SEQ.kraken_rarefaction.tsv; \ cd $WD; done joinlists.pl -h */kraken/*kraken_rarefaction.tsv > ../statistics/kraken_rarefaction.tsv """ |
82 83 84 85 | shell: """ touch {output} """ |
97 98 99 100 101 102 | shell: """ echo -e "Fraction\t{params.sample}" > {output}; for FRX in {params.frx}; do awk -v F=$FRX '$4 == "S" {{c++}} END {{print F,"\t",c}}' {params.smp}.report.$FRX.tsv; done >> {output}; awk '$4 == "S" {{c++}} END {{print "1.0\t",c}}' {params.smp}.report.tsv >> {output} """ |
113 114 115 116 | shell: """ perl {params.sct} -h {input} > {output} """ |
18 19 20 21 22 23 24 | shell: """ kraken2 --report {output.rt} \ --output {output.ot} \ --threads {threads} \ {input.r1} """ |
40 41 42 43 44 45 46 47 | shell: """ grep -v ^U {input} | \ taxonkit lineage -j {threads} -i 3 --data-dir {params.t} | \ taxonkit reformat -j {threads} --data-dir {params.t} -i 6 -f \ "Root; d__{{k}}; p__{{p}}; c__{{c}}; o__{{o}}; f__{{f}}; g__{{g}}" \ --fill-miss-rank | cut -f 2,3,7 > {output} """ |
55 56 57 58 59 60 | shell: """ echo "Family\t{wildcards.sample}" > {output}; cat {input} | sed -e 's/Candidatus//' | \ awk '{{if ($4 == "P") {{printf "%s\\t%f\\n", $6, $1}}; }}' >> {output} """ |
69 70 71 72 73 74 | shell: """ echo "Family\t{wildcards.sample}" > {output}; cat {input} | sed -e 's/Candidatus//' | \ awk '{{if ($4 == "F") {{printf "%s\\t%f\\n", $6, $1}}; }}' >> {output} """ |
83 84 85 86 87 88 | shell: """ echo "Family\t{wildcards.sample}" > {output}; cat {input} | sed -e 's/Candidatus//' | \ awk '{{if ($4 == "G") {{printf "%s\\t%f\\n", $6, $1}}; }}' >> {output} """ |
96 97 98 99 100 101 102 | shell: """ echo "Species\t{wildcards.sample}" > {output}; cat {input} | sed -e 's/Candidatus//' | \ awk '{{if ($4 == "S") {{for (i=6; i<=NF; ++i) {{printf "%s ", $i}} printf "\\t%f\\n", $1}}; }}' \ >> {output} """ |
112 113 114 115 | shell: """ perl {params.sct} -z -h {input} > {output} """ |
125 126 127 128 | shell: """ perl {params.sct} -z -h {input} > {output} """ |
137 138 139 140 | shell: """ perl {params.sct} -z -h {input} > {output} """ |
149 150 151 152 | shell: """ perl {params.sct} -z -h {input} > {output} """ |
18 19 20 21 22 23 24 25 26 27 28 29 30 | shell: """ prinseq++ -min_len 60 -min_qual_mean 25 -ns_max_n 1 -derep 1 \ -out_format 0 -trim_tail_left 5 -trim_tail_right 5 \ -ns_max_n 5 -trim_qual_type min -trim_qual_left 30 \ -trim_qual_right 30 -trim_qual_window 10 \ -threads {threads} \ -out_name {params.o} \ -out_bad /dev/null \ -out_bad2 /dev/null \ -fastq {input.r1} \ -fastq2 {input.r2}; """ |
54 55 56 57 | shell: """ for F in {input}; do gzip -c $F > $F.gz; done """ |
38 39 40 41 42 43 | shell: """ rmdir {params.odir} ; megahit -1 {input.r1} -2 {input.r2} -r {input.s1} -r {input.s2} \ -o {params.odir} -t {threads} --use-gpu --mem-flag 2 """ |
69 70 71 72 73 74 | shell: """ rmdir {params.odir} ; megahit -1 {input.r1} -2 {input.r2} -r {input.s1} -r {input.s2} \ -o {params.odir} -t {threads} --mem-flag 2 """ |
90 91 | script: "../scripts/renumber_merge_fasta_smk.py" |
147 148 149 150 151 152 153 154 155 156 157 158 159 160 | shell: """ bowtie2 --mm -x {params.contigs} -1 {input.r1} -2 {input.r2} \ -U {input.s1} -U {input.s2} --threads {threads} | \ samtools view -bh | samtools sort -o {output} - """ """ Using samtools to extract unmapped reads from the bam files A samtools flag of -f 77 means the read is paird, neither the read nor mate is mapped, and the it is the first read in the pair, while a flag of -f 141 means the same except it is the second mate in the pair. Then we use two flags: -f 4 (the read is unmapped) and -F 1 (the read is not paired) to find the single reads that are not mapped See: https://edwards.sdsu.edu/research/command-line-deconseq/ """ |
175 176 177 178 179 180 181 182 183 | shell: """ samtools view -@ {threads} -h {input} | awk 'BEGIN {{FS="\t"; OFS="\t"}} {{if (/^@/ && substr($2, 3, 1)==":") {{print}} else if (and($2, 0x1) && and($2, 0x40) && (and($2, 0x4) || and($2, 0x8))) {{print}}}}' \ | samtools bam2fq -@ {threads} > {output} """ |
198 199 200 201 202 203 204 205 206 | shell: """ samtools view -@ {threads} -h {input} | awk 'BEGIN {{FS="\t"; OFS="\t"}} {{if (/^@/ && substr($2, 3, 1)==":") {{print}} else if (and($2, 0x1) && and($2, 0x80) && (and($2, 0x4) || and($2, 0x8))) {{print}}}}' \ | samtools bam2fq -@ {threads} > {output} """ |
221 222 223 224 225 226 227 228 229 230 | shell: "samtools fastq -@ {threads} -f 4 -F 1 {input} > {output}" """ We concatanate the unassembled reads into separate R1/R2/s files so we can assmble them all together. We also do this in 3 separate threads to take advantage of parallelization and to make the command easier """ |
242 243 244 245 | shell: """ for F in {input}; do gzip $F; done """ |
257 258 | shell: "cat {input} > {output}" |
268 269 | shell: "cat {input} > {output}" |
279 280 | shell: "cat {input} > {output}" |
35 36 37 38 39 | shell: """ rmdir {params.odir}; megahit -1 {input.r1} -2 {input.r2} -r {input.s0} -o {params.odir} -t {threads} --use-gpu --mem-flag 2 """ |
64 65 66 67 68 69 70 71 72 | shell: """ rmdir {params.odir}; megahit -1 {input.r1} -2 {input.r2} -r {input.s0} -o {params.odir} -t {threads} --mem-flag 2 """ """ Combine all the contigs and use metaflye to merge the subassmblies """ |
90 91 | script: "../scripts/renumber_merge_fasta_smk.py" |
111 112 113 114 | shell: """ flye --meta --subassemblies {input.contigs} -o {CCMO} --threads {threads} """ |
130 131 | script: "../scripts/renumber_merge_fasta_smk.py" |
187 188 189 190 191 192 | shell: """ bowtie2 --mm -x {params.contigs} -1 {input.r1} -2 {input.r2} \ -U {input.s1} -U {input.s2} --threads {threads} | \ samtools view -bh | samtools sort -o {output} - """ |
201 202 | shell: "samtools index {input}" |
215 216 217 218 | shell: """ samtools idxstats -@ {threads} {input.bam} | cut -f 1,3 > {output} """ |
229 230 | script: "../scripts/rpkm.py" |
14 15 16 17 18 | shell: """ mkdir --parents {output.d}; singlem pipe --forward {input.r1} --reverse {input.r2} --otu_table {output.otu} --output_extras --threads {threads} """ |
14 15 | script: "../scripts/countfastq.py" |
25 26 | script: "../scripts/countfastq.py" |
35 36 37 38 | shell: """ python3 {params.sct} -f {input} > {output} """ |
49 50 51 52 | shell: """ perl {params.sct} -t {input} > {output} """ |
63 64 65 66 | shell: """ perl {params.sct} -t {input} > {output} """ |
76 77 78 79 | shell: """ python3 {params.sct} -f {input} -ln > {output} """ |
90 91 92 93 | shell: """ perl {params.sct} -t {input} > {output} """ |
104 105 106 107 | shell: """ perl {params.sct} {input} > {params.out} && gzip {params.out} """ |
36 37 38 39 40 41 42 43 44 45 | shell: """ superfocus -q {input.r1} -q {input.r2} -dir {params.d} -a mmseqs2 -t {threads} -n 0 -tmp $(mktemp -d -p {params.TMPDIR}) """ """ In the rules below we use the m8 files from superfocus to crate taxonomy tables. """ |
60 61 62 63 64 65 66 67 68 69 | shell: """ perl -F"\\t" -lane 'if ($F[1] =~ /fig\|(\d+)\.\d+/ && $1 != 6666666) \ {{print "$F[0]\\t$1"}}' {input.m8} | \ taxonkit lineage -j {threads} -i 2 --data-dir {params.t} | \ taxonkit reformat -j {threads} --data-dir {params.t} -i 3 \ -f "Root; d__{{k}}; p__{{p}}; c__{{c}}; o__{{o}}; f__{{f}}; g__{{g}}" \ --fill-miss-rank | \ cut -f 1,2,4 > {output} """ |
77 78 | script: "../scripts/count_superfocus_taxonomy.py" |
92 93 94 95 | shell: """ for F in {input}; do zip -j $F.zip $F; done """ |
105 106 107 108 | shell: """ for F in {input}; do zip -j $F.zip $F; done """ |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | import os import sys from atavide_lib import stream_fastq __author__ = 'Rob Edwards' # snakemake.input.fqdir is the directory of fastq files to count # snakemake.output.stats is the output file to write with open(snakemake.output.stats, 'w') as out: print("File\tNumber of Sequences\tTotal bp\tShortest length\tLongest length\tN50\tN75\tAuN", file=out) for f in os.listdir(snakemake.input.fqdir): if 'fastq' in f or 'fq' in f: print(f"Counting {f}", file=sys.stderr) lens = [] for (sid, label, seq, qual) in stream_fastq(os.path.join(snakemake.input.fqdir, f)): lens.append(len(seq)) lens.sort() length=sum(lens) len_so_far = 0 n50 = "" n75 = "" auN = 0 for i in lens: len_so_far += i if not n50 and len_so_far >= length * 0.5: n50 = i if not n75 and len_so_far >= length * 0.75: n75 = i auN += i**2 if length > 0: auN /= length else: auN = 0 print(f"{f}\t{len(lens):,}\t{length:,}\t{lens[0]:,}\t" \ + f"{lens[-1]:,}\t{n50:,}\t{n75:,}\t{int(auN):,}", file=out) else: print(f"Skipped {os.path.join(subdir, f)}. Does not appear to be fastq", file=sys.stderr) |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | from atavide_lib import read_fasta, colours __author__ = 'Rob Edwards' # snakemake.input is an array of files to merge # snakemake.output.contigs is the name of the final contigs file # snakemake.output.ids is the name of the ids file to write3 # snakemake.params.sample_id is the sample_id that will be prepended to the contig names idmap = open(snakemake.output.ids, 'w') out = open(snakemake.output.contigs, 'w') counter = 0 for f in snakemake.input: fa = read_fasta(f) for id in fa: counter += 1 if hasattr(snakemake, 'params') and hasattr(snakemake.params, 'sample_id'): out.write(f">{snakemake.params.sample_id}_{counter}\n{fa[id]}\n") idmap.write(f"{f}\t{id}\t{snakemake.params.sample_id}_{counter}\n") else: out.write(f">{counter}\n{fa[id]}\n") idmap.write(f"{f}\t{id}\t{counter}\n") idmap.close() out.close() |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | import os import sys from atavide_lib import stream_fastq __author__ = 'Rob Edwards' if not os.path.exists(snakemake.input.r1): sys.stderr.write(f"ERROR: snakemake.input.r1 {snakemake.input.r1} not found\n") if not os.path.exists(snakemake.input.contig_len): sys.stderr.write(f"ERROR: snakemake.input.contig_len {snakemake.input.contig_len} not found\n") if not os.path.exists(snakemake.input.hits): sys.stderr.write(f"ERROR: snakemake.input.hits {snakemake.input.hits} not found\n") numseqs = 0 for seqid, header, seq, qualscores in stream_fastq(snakemake.input.r1): numseqs+=1 numseqs /= 1e6 contigs = {} with open(snakemake.input.contig_len, 'r') as f: for l in f: r = l.strip() if not r: continue p = r.split("\t") if len(p) < 2: sys.stderr.write(f"Skipped {r} in {f}\n") continue contigs[p[0]] = int(p[1]) with open(snakemake.input.hits, 'r') as f, open(snakemake.output.rpkm, 'w') as out: for l in f: p = l.strip().split("\t") if p[0] == '*': continue if p[0] not in contigs: sys.stderr.write(f"ERROR: contig {p[0]} does not have a length.\n") contigs[p[0]]=1 rpkm = int(p[1])/numseqs/contigs[p[0]] print(f"{p[0]}\t{rpkm}", file=out) |
Support
- Future updates
Related Workflows





