Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
MAG Snakemake Workflow
Contents
Overview
This pipeline has the scripts and modules used to generate, quality assess, and taxonomically classify skin MAGs which the pipeline will generate using different per sample and co-assembly approaches. The prok.Snakefile and euk.Snakefile detail the prokaryotic and eukaryotic analyses respectively. Note that for the eukaryotic analyses, EukRep and EukCC need to already have been run. To use the pipeline a file called coassembly_runs.txt and runs.txt must be produced which detail the co-assembly approaches and the per sample approaches used in our pipeline. The metagenomes must be put in the directory data/singlerun and data/coassembly for the single run and co-assembly approaches respectively. A sample coassembly_runs.txt and runs.txt file has been provided. This code has been adapted from https://github.com/Finn-Lab/MAG_Snakemake_wf/ accompanying the paper (https://www.nature.com/articles/s41596-021-00508-2). The viral analysis was done via the VIRify pipeline (https://github.com/EBI-Metagenomics/emg-viral-pipeline).
System Requirements
Hardware Requirements
HPC with at least 500 gigabytes of memory
Software Requirements
-
parallel-fastq-dump v0.6.6
-
KneadData v0.7.4 (Bowtie2 v2.3.5.1, Trimmommatic v0.39)
-
SPAdes v.3.13.0
-
metaWRAP v1.2.1
-
INFERNAL v1.1.2
-
tRNAScan-SE v2.0
-
BBMap v.37.62
-
GUNC v1.0.1
-
CAT v5.2.1
-
CheckM v1.1.2
-
Mash v.2.0
-
MUMmer v3.23
-
QUASTv5.0.2
-
dRep v2.3.2
-
GTDB-Tkv1.0.2
-
ncbi-genome-downloadv0.2.12
-
EukRep v.0.6.7
-
EukCC v0.2
Running pipeline
Submitting jobs
To run pipeline on the runs specified in runs.txt and coassembly_runs.txt, submit jobs with SLURM scheduler:
snakemake -s prok.Snakefile --use-singularity --restart-times 3 -k -j 50 --cluster-config clusterconfig.yaml --cluster "sbatch -n {cluster.nCPU} --mem {cluster.mem} -e {cluster.error} -o {cluster.output} -t {cluster.time}"
Submit jobs with LSF scheduler:
snakemake -s prok.Snakefile --use-singularity --restart-times 3 -k --jobs 50 --cluster-config clusterconfig.yaml --cluster "bsub -n {cluster.nCPU} -M {cluster.mem} -e {cluster.error} -o {cluster.output}"
Code Snippets
61 62 63 | shell:""" scp {input.tsv} {output.tsv} """ |
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | run: os.system("mkdir -p " + str(params.odir)) infile=str(input) outfile=str(output) qc=str(params.qc) with open(infile) as inp: LoL=[x.strip().split('\t') for x in inp] row1=LoL[1] completeness=LoL[1][0] contamination=LoL[1][1] run=str(params.run) g = open(outfile, "w") if float(completeness)>50 and float(contamination)<5: g.write(run+','+completeness+','+contamination+'\n') g.close() h = open(params.qc, "a") h.write(run+','+completeness+','+contamination+'\n') h.close() f = str(params.f) odir=str(params.odir) os.system("scp "+f+ " "+str(odir)) else: g.write("") g.close() |
104 | shell: "echo -e 'genome,completeness,contamination' | cat - {input} > {output}" |
117 118 119 120 121 122 123 124 | shell:""" dRep dereplicate -p {threads} \ {params.outfolder} -g {params.infolder}/*.fa \ -pa 0.9 -sa 0.95 -nc 0.3 \ -cm larger \ --genomeInfo {input.genome_info} \ -comp 50 -con 5 """ |
135 136 | shell: "mash dist -p {threads} {input.db} {input.bins} > {output}" |
146 147 148 149 | shell: """ sort -gk3 {input.mashdist}|sed -n 1p >{output} """ |
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | shell: """ mkdir -p {params.genomesdir} while read col1 col2 rem do f="$(basename -- ${{col1}})" echo {params.genomesdir}/${{f%%.gz}} if [ ! -f {params.genomesdir}/${{f%%.gz}} ]; then echo "copying file over" scp ${{col1}} {params.genomesdir} gunzip {params.genomesdir}/${{f}} fi echo 'dnadiff ${{col1}} ${{col2}} -p ${{col1%%.fasta}}_${{col2%%.fa}}_' dnadiff {params.genomesdir}/${{f%%.gz}} ${{col2}} -p {params.outdir}/{params.bins} done < {input.bestmash} """ |
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | run: outfile = str(output) f = open(input.dnadiff) data = f.read() first_line = data.split("\n", 1)[0] a = first_line.split(" ") ref = a[0] quer = a[1] with open(outfile, "w") as outf: path_dna = input.dnadiff base = os.path.basename(path_dna) base = base.split(".report")[0] with open(path_dna) as f: for line in f: if "TotalBases" in line: cols = line.split() lenref = int(cols[1]) lenquer = int(cols[2]) if "AlignedBases" in line: cols = line.split() aliref = cols[1].split("(")[-1].split("%")[0] alique = cols[2].split("(")[-1].split("%")[0] if "AvgIdentity" in line: cols = line.split() ident = float(cols[1]) line = "%s\t%s\t%i\t%.2f\t%i\t%.2f\t%.2f" % (ref, quer, lenref, float(aliref), lenquer, float(alique), float(ident)) outf.writelines(line + "\n") |
218 219 220 221 | shell: """ cat {input}>{output} """ |
32 33 34 35 36 37 38 | shell: """ rm -rf {params.outdir} dmesg -T spades.py --meta -1 {input.fwd} -2 {input.rev} \ -o {params.outdir} -t {threads} -m {resources.mem} """ |
54 55 56 57 58 59 60 | shell: """ rm -rf {params.outdir} dmesg -T spades.py --meta -1 {input.fwd} -2 {input.rev} \ -o {params.outdir} -t {threads} -m {resources.mem} """ |
67 68 69 70 71 72 73 74 | shell: """ rm -rf {params.outdir} metawrap binning -t {threads} -m {resources.mem}\ -a {input.fasta} --maxbin2 --metabat2 --concoct \ -l {params.mincontiglength} -o {params.outdir} {input.fwd} {input.rev} touch {output.outfile} """ |
93 94 95 96 97 98 99 100 | shell: """ rm -rf {params.outdir} metawrap binning -t {threads} -m {resources.mem} \ -a {params.assembly} --metabat2 --maxbin2 --concoct \ -l {params.mincontiglength} -o {params.outdir} {params.reads} touch {output} """ |
35 36 37 38 39 | shell: """ cat {input.r1} > {output.r1} cat {input.r2} > {output.r2} """ |
51 52 53 54 55 56 | shell: """ repair.sh in={input.fwd} in2={input.rev} out={output.fwd} out2={output.rev} repair rm -f {input.fwd} rm -f {input.rev} """ |
69 70 71 72 73 74 75 76 77 78 79 | shell: """ rm -f {params.fwd} rm -f {params.rev} rm -f {output.fwd} rm -f {output.rev} gzip {input.fwd} gzip {input.rev} mv {params.fwd} {output.fwd} mv {params.rev} {output.rev} """ |
89 90 91 92 93 94 95 96 97 98 | run: outfile = str(output) if os.path.exists(outfile): os.remove(outfile) with open(outfile, "w") as outf: outf.writelines("Run\tReadcount\n") for run in input: readcount = int(linecount(run)) line = run + "\t" + str(readcount) outf.writelines(line + "\n") |
29 30 31 32 33 34 | shell: """ rm -rf {params.outdir} dRep dereplicate -p {threads} {params.outdir} -g {params.indir}/*.fa \ -pa 0.9 -sa 0.95 -nc 0.30 -cm larger --genomeInfo {input.checkm} -comp 50 -con 5 """ |
45 46 47 48 49 50 | shell: """ awk 'FNR!=1' {input.checkm} {input.checkm_coas}>{params.checkm} echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm} > {output} rm {params.checkm} """ |
66 67 68 69 70 71 72 73 74 | shell: """ rm -rf {params.outdir} mkdir -p {params.indir} rsync {params.singlerun_dRep}/* {params.indir} rsync {params.all_coas}/* {params.indir} dRep dereplicate -p {threads} {params.outdir} -g {params.indir}/*.fa \ -pa 0.9 -sa 0.95 -nc 0.30 -cm larger --genomeInfo {input.checkm} -comp 50 -con 5 """ |
88 89 90 91 92 93 94 | shell: """ real=$(realpath {input.gtdbrelease}) rm -rf {params.outdir} export GTDBTK_DATA_PATH=${{real}} gtdbtk classify_wf --cpus {threads} --genome_dir {params.indir} --out_dir {params.outdir} -x {params.ext} """ |
102 103 104 105 | shell: """ Rscript scripts/plotting/plot_gtdb.R {input} """ |
30 31 32 33 34 35 36 37 | shell: """ checkm data setRoot ~/checkm_database/ rm -rf {params.outdir} rm -rf {output} metawrap bin_refinement -o {params.outdir} -t {threads} -A {params.metabat} -B {params.maxbin} -C {params.concoct} -c 50 -x 10 touch {output} """ |
48 49 50 51 52 53 | shell: """ rm -rf {params.folder} mkdir -p {params.folder} cp {params.refined_folder}/*.fa {params.folder} """ |
72 73 74 75 76 77 78 | shell: """ mkdir -p {params.dir1} mkdir -p {params.dir2} cp {input} {params.out1} cp {input} {params.out2} """ |
86 87 88 89 | shell: """ touch {output} """ |
101 102 103 104 105 106 | shell: """ checkm data setRoot ~/checkm_database/ rm -rf {params.outdir} checkm lineage_wf -t {threads} -x {params.ext} --tab_table -f {output} {params.indir} {params.outdir} """ |
117 118 119 120 121 122 123 124 125 | shell: """ sed -i '1d' {input} cut -f1,12,13,14 {input} | tr '\t' ','>{params.checkm} sed 's/,/.fa,/' {params.checkm}>{params.checkm2} echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm2} > {output} rm {params.checkm} rm {params.checkm2} """ |
30 31 32 33 34 35 36 | shell: """ checkm data setRoot ~/checkm_database/ rm -rf {params.outdir} metawrap bin_refinement -o {params.outdir} -t {threads} -A {params.metabat} -B {params.maxbin} -C {params.concoct} -c 50 -x 10 touch {output} """ |
47 48 49 50 51 52 | shell: """ rm -rf {params.folder} mkdir -p {params.folder} cp {params.refined_folder}/*.fa {params.folder} """ |
72 73 74 75 76 77 78 | shell: """ mkdir -p {params.dir1} mkdir -p {params.dir2} cp {input} {params.out1} cp {input} {params.out2} """ |
86 87 88 89 | shell: """ touch {output} """ |
101 102 103 104 105 106 | shell: """ checkm data setRoot ~/checkm_database/ rm -rf {params.outdir} checkm lineage_wf -t {threads} -x {params.ext} --tab_table -f {output} {params.indir} {params.outdir} """ |
117 118 119 120 121 122 123 124 125 | shell: """ sed -i '1d' {input} cut -f1,12,13,14 {input} | tr '\t' ','>{params.checkm} sed 's/,/.fa,/' {params.checkm}>{params.checkm2} echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm2} > {output} rm {params.checkm} rm {params.checkm2} """ |
135 136 137 138 | shell: """ Rscript scripts/plotting/plot_checkm_mags.R {input.sr} {input.coas} """ |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | library(ggplot2) args <- commandArgs(trailingOnly = TRUE) checkm_sr=read.csv(args[1],header=TRUE,stringsAsFactor=FALSE) checkm_coas=read.csv(args[2],header=TRUE,stringsAsFactor=FALSE) checkm_sr$Status2="Single run" checkm_coas$Status2="Co-assembly" checkm=rbind.data.frame(checkm_sr,checkm_coas) checkm$Status=factor(checkm$Status, levels=c("Single run", "Co-assembly")) setwd("data/figures/") ggplot(checkm, aes(x=Status, y=completeness, fill=Status)) + geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Completeness") + xlab("Assembly approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+guides(color=guide_legend(title="Approach"))+theme(legend.position="none") ggsave("checkm_completeness.png",plot = last_plot()) ggplot(checkm, aes(x=Status, y=contamination, fill=Status))+geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Contamination") + xlab("Assembly approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+theme(legend.position="none")+ylim(0,10) ggsave("checkm_contam.png",width=5,height=5) |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | library(RColorBrewer) library(tidyr) library(ggplot2) args <- commandArgs(trailingOnly = TRUE) gtdb.taxa=read.delim(args[1],header=TRUE,stringsAsFactor=FALSE) ranks = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species") gtdb.taxa = separate(data = gtdb.taxa, col = classification, sep = ";", into = ranks) gtdb.bac = gtdb.taxa[which(gtdb.taxa$Domain == "d__Bacteria"),] # calculate bacteria proportions if(exists("gtdb.fi.bac")){ rm(gtdb.fi.bac) } for (r in ranks[2:(length(ranks)-1)]){ total = nrow(gtdb.bac) #total number of genomes rank.present = table(gtdb.bac[!grepl("__$", gtdb.bac[,r]),r]) #summary of genus total.rank = sum(rank.present) #sum of genus counts gtdb.ranks = data.frame(sort(rank.present)[(length(rank.present)-6):length(rank.present)]) gtdb.ranks$Prop = gtdb.ranks$Freq/total*100 other.name = paste(tolower(substr(r, 1, 1)), "__Other", sep="") novel.name = paste(tolower(substr(r, 1, 1)), "__Novel", sep="") other.freq = total.rank-sum(gtdb.ranks$Freq) novel.freq=total-total.rank #novel genomes other.prop = other.freq/total*100 novel.prop = novel.freq/total*100 other.ranks = data.frame(Var1=other.name, Freq=other.freq, Prop=other.prop) novel.ranks = data.frame(Var1=novel.name, Freq=novel.freq, Prop=novel.prop) ranks.fi = rbind(other.ranks, gtdb.ranks) ranks.fi = rbind(novel.ranks, ranks.fi) ranks.fi$Rank = r ranks.fi$Colour=c("#999999","#CAB2D6","#A6CEE3","#FF7F00","#FB9A99","#E31A1C","#B2DF8A","#33A02C","#1F78B4") if(exists("gtdb.fi.bac")){ gtdb.fi.bac = rbind(gtdb.fi.bac, ranks.fi) } else { gtdb.fi.bac = ranks.fi } } gtdb.fi.bac$Rank=factor(gtdb.fi.bac$Rank,levels=c("Phylum","Class","Order","Family","Genus")) # plot stacked plot taxa counts p<-print(ggplot(gtdb.fi.bac, aes(x=Rank, y=Prop, fill=Var1)) + geom_bar(stat="identity", colour="darkgrey", alpha=0.5, size=0.2, width=0.7) + theme_bw() + ylab("Proportion of species (%)") + coord_flip() + scale_fill_manual(values=as.vector(gtdb.fi.bac$Colour), name="Taxa") #+ guides(fill=FALSE) + scale_x_discrete(limits=ranks[(length(ranks)-1):2]) + theme(legend.position="right") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + theme(axis.title.x = element_text(size=14)) + theme(axis.text.y = element_text(size=12)) + theme(axis.title.y = element_blank()) + theme(axis.text.x = element_text(size=12))) ggsave("data/figures/gtdb_bacteria.png",p,width=15) |
Support
- Future updates
Related Workflows





