Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
MAG Snakemake Workflow
Contents
Overview
This pipeline can be used for recovery and quality assessment of prokaryotic MAGs from short-read, host-associated metagenomic datasets. The data analyzed is specificed using two files, which detail the co-assembly samples and single runs that are to be analyzed. The file runs.txt has the SRA accession for the single runs with a different accession in each line. The file coassembly_runs.txt specifies the co-assembly samples. This file is in tabular format with three columns, the first column specifies the name of the resulting co-assembly sample. The r1 and r2 columns specify the path of the forward and reverse reads constituting each co-assembly sample, with each read path separated by a comma. In this pipeline, we used a small subset of gut dataset previously analyzed by Almeida et al. There are 40 single runs and 2 co-assembly samples. The co-assembly samples are named based on the metadata that was chosen to perform co-assembly. If the runs are not publicly available, they should be placed in the subdirectory data/raw with respect to the Snakefile. The forward and reverse reads must have the extensions _1.fastq and _2.fastq respectively. Note that the version of kneaddata used in this pipeline still requires the read suffixes /1 and /2, so the headers must be formatted accordingly. If the runs are publicly available, specify them in the runs.txt file and the sra_download module will download the runs from the SRA using their SRA accession to the directory data/raw.
System Requirements
Hardware Requirements
HPC with at least 500 gigabytes of memory
The CPU times below are generated using the cluster config file included in the repo
Software Requirements
MAG Snakemake pipeline (https://github.com/Finn-Lab/MAG_Snakemake_wf)
Singularity 3.5.0 (https://github.com/hpcng/singularity)
Snakemake (version 5.18) (https://github.com/snakemake/snakemake)
Running the MAG Snakemake pipeline will automatically download the sequencing data from the SRA. It will also download the relevant singularity containers so the relevant software needed for our pipeline can be used. Alternatively, the tools can be manually downloaded from:
ncbi-genome-download (version 0.3.0) (https://github.com/kblin/ncbi-genome-download)
mash (version 2.2.1) (https://github.com/marbl/Mash)
parallel-fastq-dump (version 0.6.6) & fastq-dump (version 2.8.0) (https://github.com/rvalieris/parallel-fastq-dump)
fastqc (version 0.11.7) (https://github.com/s-andrews/FastQC)
multiqc (version 1.3) (https://github.com/ewels/MultiQC)
kneaddata (version 0.7.4) with Trimmomatic (version 0.39) & Bowtie (version 2.4.2) (https://github.com/biobakery/kneaddata)
metaSPAdes (version 3.14.0) (https://github.com/ablab/spades)
metaWRAP (version 1.2.2) (https://github.com/bxlab/metaWRAP)
CheckM (version 1.0.12) (https://github.com/Ecogenomics/CheckM)
Bowtie (version 2.4.1) (https://github.com/BenLangmead/bowtie2)
Prokka (version 1.14.5) (https://github.com/tseemann/prokka)
CMSeq (version 1.0) (https://bitbucket.org/CibioCM/cmseq)
mummer (version 3.23) (https://github.com/mummer4/mummer)
dRep (version 2.3.2) (https://github.com/MrOlm/drep)
GTDB_Tk (version 1.2.0) (https://github.com/Ecogenomics/GTDBTk)
bwa (version 0.7.17) (https://github.com/lh3/bwa)
samtools (version 1.9) (https://github.com/samtools/samtools)
Other
RefSeq complete bacterial genomes (downloaded May 2020) (https://www.ncbi.nlm.nih.gov/refseq/)
GTDB database (release 95) (https://data.ace.uq.edu.au/public/gtdb/data/releases/)
Workflow setup
Download the GTDB database using:
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/auxillary_files/gtdbtk_r95_data.tar.gz
Download all RefSeq bacterial genomes using:
ncbi-genome-download bacteria --formats fasta --section refseq --assembly-levels complete
Next generate a Mash sketch of the database with default k-mer and sketch size from the main directory using:
mash sketch -o refseq.msh /path/to/RefSeq/*fasta
Download the code for the pipeline from (https://github.com/Finn-Lab/MAG_Snakemake_wf) in a location that has at least 1.5 TB of disk space. Change directory to this folder. Move the GTDB database and the RefSeq Mash sketch to the subfolder /data/databases using:
cd /path/to/MAG_Snakemake_wf/
mkdir -p data/databases
mv /path/to/refseq.msh data/databases
mv /path/to/gtdbtk_r95_data.tar.gz data/databases
tar -xvzf data/databases/gtdbtk_r95_data.tar.gz
Install snakemake into an environment using:
conda create -c conda-forge -c bioconda -n snakemake snakemake=5.18
Then activate the environment before using snakemake:
conda activate snakemake
Running pipeline
Submitting jobs
To run pipeline on the small gut dataset specified in runs.txt and coassembly_runs.txt, submit jobs with SLURM scheduler:
snakemake --use-singularity --restart-times 3 -k -j 50 --cluster-config clusterconfig.yaml --cluster "sbatch -n {cluster.nCPU} --mem {cluster.mem} -e {cluster.error} -o {cluster.output} -t {cluster.time}"
Submit jobs with LSF scheduler:
snakemake --use-singularity --restart-times 3 -k --jobs 50 --cluster-config clusterconfig.yaml --cluster "bsub -n {cluster.nCPU} -M {cluster.mem} -e {cluster.error} -o {cluster.output}"
CPU-time
The CPU time for the demo dataset and the cluster configuration file provided is as follows:
Data Download: 16 hours
Preprocessing: 48 hours
Assembly/Co-assembly: 2600 hours
Binning: 100 hours
Quality Assessment & Bin Refinement; Estimate completeness and contamination of MAGs: 25 hours
Quality Assessment & Bin Refinement; Estimate strain heterogeneity of MAGs: 700 hours
Quality Assessment & Bin Refinement; Compare MAGs to RefSeq genomes: 1 hour
Quality Assessment & Bin Refinement; Bin refinement: 260 hours
Dereplicate MAGs: 4 hours
Taxonomic Classification: 3 hours
Evaluate Bottlenecks: 120 hours
Citation
For a walk-through of this pipeline, please visit: Saheb Kashaf, S., Almeida, A., Segre, J.A. et al. Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data. Nat Protoc (2021). https://doi.org/10.1038/s41596-021-00508-2
To generate results from the paper associated with this pipeline, please select a figure to reproduce in the Snakefile.
Code Snippets
34 35 36 37 38 39 40 | shell: """ rm -rf {params.outdir} dmesg -T spades.py --meta -1 {input.fwd} -2 {input.rev} \ -o {params.outdir} -t {threads} -m {resources.mem} """ |
58 59 60 61 62 63 64 | shell: """ rm -rf {params.outdir} dmesg -T spades.py --meta -1 {input.fwd} -2 {input.rev} \ -o {params.outdir} -t {threads} -m {resources.mem} """ |
69 70 71 72 73 74 75 76 | shell: """ rm -rf {params.outdir} metawrap binning -t {threads} -m {resources.mem}\ -a {input.fasta} --maxbin2 --metabat2 --concoct \ -l {params.mincontiglength} -o {params.outdir} {input.fwd} {input.rev} touch {output.outfile} """ |
97 98 99 100 101 102 103 104 | shell: """ rm -rf {params.outdir} metawrap binning -t {threads} -m {resources.mem} \ -a {params.assembly} --metabat2 --maxbin2 --concoct \ -l {params.mincontiglength} -o {params.outdir} {params.reads} touch {output} """ |
40 41 42 43 | shell: """ scripts/rename_multifasta_prefix.py -f {input.MAG} -p {params.name} > {output} """ |
64 65 66 67 68 | shell: """ prokka {input} --kingdom Bacteria --outdir {params.out_prokka} \ --prefix {params.prefix} --force --locustag {params.prefix} --cpus {threads} """ |
82 83 84 85 | shell: """ scripts/rename_multifasta_prefix.py -f {input.MAG} -p {params.name} > {output} """ |
106 107 108 109 110 | shell: """ prokka {input} --kingdom Bacteria --outdir {params.out_prokka} \ --prefix {params.prefix} --force --locustag {params.prefix} --cpus {threads} """ |
135 136 137 138 | shell: """ scripts/cmseq.sh -t {threads} -i {input.r1} -n {input.r2} -r {input.MAG} -g {input.prokka} -o {params.name} """ |
172 173 174 175 176 177 178 179 180 181 | shell: """ rm -f {output.cmseq} rm -f {output.done} for i in {params.r1}; do run=$(basename ${{i}} _1.fastq); scripts/cmseq.sh\ -t {threads} -i ${{i}} -n ${{i%%_1.fastq}}_2.fastq -r {input.MAG} -g {input.prokka}\ -o {params.name}_${{run}}; awk 'NR==2' {params.name}_${{run}}.cmseq.csv >> {output.cmseq};done touch {output.done} """ |
191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 | run: if os.path.exists(str(output)): os.remove(str(output)) else: print("Aggregating single run strain heterogeneity results") path = str(params) file1 = open(str(output), "w") for filename in glob.glob(os.path.join(path, "*.csv")): with open(filename, "r") as f: lines = f.readlines() if len(lines) > 1: lines_sub = lines[1].strip() L = filename + "\t" + lines_sub + "\n" file1.writelines(L) file1.close() |
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | run: if os.path.exists(str(output)): os.remove(str(output)) else: print("Aggregating coassembly strain heterogeneity results") path = str(params) file1 = open(str(output), "w") for filename in glob.glob(os.path.join(path, "*SRR*.csv")): with open(filename, "r") as f: lines = f.readlines() if len(lines) > 1: lines_sub = lines[1].strip() L = filename + "\t" + lines_sub + "\n" file1.writelines(L) file1.close() |
239 240 241 242 | shell: """ cat {input}>{output} """ |
250 251 252 253 | shell: """ cat {input}>{output} """ |
264 265 266 267 | shell: """ Rscript scripts/plotting/plot_cmseq.R {input.sr} {input.coas} """ |
35 36 37 38 39 | shell: """ cat {input.r1} > {output.r1} cat {input.r2} > {output.r2} """ |
51 52 53 54 55 56 | shell: """ repair.sh in={input.fwd} in2={input.rev} out={output.fwd} out2={output.rev} repair rm -f {input.fwd} rm -f {input.rev} """ |
69 70 71 72 73 74 75 76 77 78 79 | shell: """ rm -f {params.fwd} rm -f {params.rev} rm -f {output.fwd} rm -f {output.rev} gzip {input.fwd} gzip {input.rev} mv {params.fwd} {output.fwd} mv {params.rev} {output.rev} """ |
89 90 91 92 93 94 95 96 97 98 | run: outfile = str(output) if os.path.exists(outfile): os.remove(outfile) with open(outfile, "w") as outf: outf.writelines("Run\tReadcount\n") for run in input: readcount = int(linecount(run)) line = run + "\t" + str(readcount) outf.writelines(line + "\n") |
29 30 | shell: "mash dist -p {threads} {input.db} {input.bins} > {output}" |
43 44 45 46 | shell: """ sort -gk3 {input.mashdist}|sed -n 1p >{output} """ |
62 63 64 65 66 67 68 69 | shell: """ while read col1 col2 rem do echo 'dnadiff ${{col1}} ${{col2}} -p ${{col1%%.fasta}}_${{col2%%.fa}}_' dnadiff ${{col1}} ${{col2}} -p {params.outdir}/{params.bins} done < {input.bestmash} """ |
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | run: outfile = str(output) f = open(input.dnadiff) data = f.read() first_line = data.split("\n", 1)[0] a = first_line.split(" ") ref = a[0] quer = a[1] with open(outfile, "w") as outf: path_dna = input.dnadiff base = os.path.basename(path_dna) base = base.split(".report")[0] with open(path_dna) as f: for line in f: if "TotalBases" in line: cols = line.split() lenref = int(cols[1]) lenquer = int(cols[2]) if "AlignedBases" in line: cols = line.split() aliref = cols[1].split("(")[-1].split("%")[0] alique = cols[2].split("(")[-1].split("%")[0] if "AvgIdentity" in line: cols = line.split() ident = float(cols[1]) line = "%s\t%s\t%i\t%.2f\t%i\t%.2f\t%.2f" % (ref, quer, lenref, float(aliref), lenquer, float(alique), float(ident)) outf.writelines(line + "\n") |
118 119 120 121 | shell: """ cat {input}>{output} """ |
129 130 131 132 | shell: """ cat {input}>{output} """ |
147 148 149 150 | shell: """ cat {input}>{output} """ |
162 163 164 165 | shell: """ Rscript scripts/plotting/dnadiff_plot.R {input.checkm_sr} {input.checkm_coas} {input.summ} """ |
31 32 33 34 35 36 | shell: """ rm -rf {params.outdir} dRep dereplicate -p {threads} {params.outdir} -g {params.indir}/*.fa \ -pa 0.9 -sa 0.95 -nc 0.30 -cm larger --genomeInfo {input.checkm} -comp 50 -con 5 """ |
47 48 49 50 51 52 | shell: """ awk 'FNR!=1' {input.checkm} {input.checkm_coas}>{params.checkm} echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm} > {output} rm {params.checkm} """ |
70 71 72 73 74 75 76 77 78 | shell: """ rm -rf {params.outdir} mkdir -p {params.indir} rsync {params.singlerun_dRep}/* {params.indir} rsync {params.all_coas}/* {params.indir} dRep dereplicate -p {threads} {params.outdir} -g {params.indir}/*.fa \ -pa 0.9 -sa 0.95 -nc 0.30 -cm larger --genomeInfo {input.checkm} -comp 50 -con 5 """ |
94 95 96 97 98 99 100 | shell: """ real=$(realpath {input.gtdbrelease}) rm -rf {params.outdir} export GTDBTK_DATA_PATH=${{real}} gtdbtk classify_wf --cpus {threads} --genome_dir {params.indir} --out_dir {params.outdir} -x {params.ext} """ |
110 111 112 113 | shell: """ Rscript scripts/plotting/plot_gtdb.R {input} """ |
33 34 35 36 37 38 39 40 41 42 43 44 | shell: """ rm -rf {params.dir} mkdir -p {params.dir} scp {input.scaffold} {params.scaffold} bwa index {params.scaffold} bwa mem -t {threads} {params.scaffold} {input.fwd} {input.rev} | samtools view -bS - | \ samtools sort -@ {threads} -o {params.alignedsorted} - samtools index {params.alignedsorted} samtools flagstat {params.alignedsorted} > {output.flagstat} rm {params.alignedsorted} """ |
54 55 56 57 | shell: """ crimson flagstat {input}>{output} """ |
65 66 67 68 69 70 71 72 73 74 75 76 77 | run: with open(str(output), "w") as outf: for i in input: run = str(i).split("singlerun/")[1] run = run.split("/mapreads/")[0] dict = eval(open(str(i), "r").read()) supplementary = dict["pass_qc"]["supplementary"] secondary = dict["pass_qc"]["secondary"] mapped = dict["pass_qc"]["mapped"] total = dict["pass_qc"]["total"] perc = (mapped - supplementary - secondary) outf.write(str(run) + "\t" + str(perc) + "\n") outf.close() |
94 95 96 97 98 | shell: """ cat {params.indir}/*fa>{output} bwa index {output} """ |
110 111 112 113 114 | shell: """ cat {params.indir}/*.fa>{output} bwa index {output} """ |
128 129 130 131 132 133 134 135 | shell: """ bwa mem -t {threads} {input.catalogue} {input.fwd} {input.rev} \ | samtools view -bS - | samtools sort -@ {threads} -o {params.alignedsorted} - samtools index {params.alignedsorted} samtools flagstat {params.alignedsorted} > {output.flagstat} rm {params.alignedsorted} """ |
149 150 151 152 153 154 155 156 | shell: """ bwa mem -t {threads} {input.catalogue} {input.fwd} {input.rev}\ | samtools view -bS - | samtools sort -@ {threads} -o {params.alignedsorted} - samtools index {params.alignedsorted} samtools flagstat {params.alignedsorted} > {output.flagstat} rm {params.alignedsorted} """ |
166 167 168 169 | shell: """ crimson flagstat {input}>{output} """ |
179 180 181 182 | shell: """ crimson flagstat {input}>{output} """ |
190 191 192 193 194 195 196 197 198 199 200 201 202 | run: with open(str(output), "w") as outf: for i in input: base = os.path.basename(str(i)) run = os.path.splitext(base)[0] dict = eval(open(str(i), "r").read()) supplementary = dict["pass_qc"]["supplementary"] secondary = dict["pass_qc"]["secondary"] mapped = dict["pass_qc"]["mapped"] total = dict["pass_qc"]["total"] perc = (mapped - supplementary - secondary) outf.write(str(run) + "\t" + str(perc) + "\n") outf.close() |
211 212 213 214 215 216 217 218 219 220 221 222 223 | run: with open(str(output), "w") as outf: for i in input: base = os.path.basename(str(i)) run = os.path.splitext(base)[0] dict = eval(open(str(i), "r").read()) supplementary = dict["pass_qc"]["supplementary"] secondary = dict["pass_qc"]["secondary"] mapped = dict["pass_qc"]["mapped"] total = dict["pass_qc"]["total"] perc = (mapped - supplementary - secondary) outf.write(str(run) + "\t" + str(perc) + "\n") outf.close() |
241 242 243 244 | shell: """ Rscript scripts/plotting/plot_framework.R {input.readcounts} {input.flagstat} {input.mapreads_sr} {input.mapreads_coas} {params.summary} """ |
30 31 32 33 | shell: """ fastqc {input.fwd} --outdir {params.outdir} """ |
47 48 49 50 | shell: """ fastqc {input.rev} --outdir {params.outdir} """ |
65 66 67 68 69 70 | shell: """ rm -f {params.outfile} multiqc {params.indir} -o {params.outdir} mv {params.outfile} {output} """ |
85 86 87 88 | shell: """ kneaddata_database --download human_genome bowtie2 {params.outdir} """ |
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | shell: """ #this version of kneaddata requires read identifiers (/1, /2). Given we did not download the file from the sra using the option --readids, we need to remove spaces in the headers # and then add /1 and /2 using reformat.sh. Kneaddata by default does not sort the reads so Next we sort the reads. # Alternatively, one could sort using kneaddata with the option --reorder rm -f {params.tmp_fwd} {params.tmp_rev} {params.tmp_fwd2} {params.tmp_rev2} {params.fwd} {params.rev} {output.fwd} {output.rev} mkdir -p {params.outdir} seqtk seq -C {input.fwd} > {params.tmp_fwd} seqtk seq -C {input.rev} > {params.tmp_rev} reformat.sh in={params.tmp_fwd} in2={params.tmp_rev} out1={params.tmp_fwd2} out2={params.tmp_rev2} addslash spaceslash=f kneaddata --remove-intermediate-output --threads {threads} \ --input {params.tmp_fwd2} --input {params.tmp_rev2}\ --output {params.outdir} \ --reference-db {params.indx} \ --trimmomatic-options "ILLUMINACLIP:/data/adapters/TruSeq3-PE.fa:2:30:10: SLIDINGWINDOW:4:20 MINLEN:50" --trimmomatic /data/\ --bowtie2-options "--very-sensitive --dovetail" repair.sh in={params.fwd} in2={params.rev} out={output.fwd} out2={output.rev} repair rm {params.tmp_fwd} {params.tmp_rev} {params.tmp_fwd2} {params.tmp_rev2} """ |
147 148 149 150 | shell: """ fastqc {input.fwd} --outdir {params.outdir} """ |
163 164 165 166 | shell: """ fastqc {input.rev} --outdir {params.outdir} """ |
181 182 183 184 185 | shell: """ multiqc --force {params.indir} -o {params.outdir} mv {params.outfile} {output} """ |
198 199 200 201 202 203 204 205 206 207 | run: outfile = str(output) if os.path.exists(outfile): os.remove(outfile) with open(outfile, "w") as outf: outf.writelines("Run\tReadcount\n") for run in input: readcount = int(linecount(run)) line = run + "\t" + str(readcount) outf.writelines(line + "\n") |
32 33 34 35 36 37 38 39 | shell: """ checkm data setRoot ~/checkm_database/ rm -rf {params.outdir} rm -rf {output} metawrap bin_refinement -o {params.outdir} -t {threads} -A {params.metabat} -B {params.maxbin} -C {params.concoct} -c 50 -x 10 touch {output} """ |
50 51 52 53 54 55 | shell: """ rm -rf {params.folder} mkdir -p {params.folder} cp {params.refined_folder}/*.fa {params.folder} """ |
74 75 76 77 78 79 80 | shell: """ mkdir -p {params.dir1} mkdir -p {params.dir2} cp {input} {params.out1} cp {input} {params.out2} """ |
88 89 90 91 | shell: """ touch {output} """ |
105 106 107 108 109 110 | shell: """ checkm data setRoot ~/checkm_database/ rm -rf {params.outdir} checkm lineage_wf -t {threads} -x {params.ext} --tab_table -f {output} {params.indir} {params.outdir} """ |
121 122 123 124 125 126 127 128 129 | shell: """ sed -i '1d' {input} cut -f1,12,13,14 {input} | tr '\t' ','>{params.checkm} sed 's/,/.fa,/' {params.checkm}>{params.checkm2} echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm2} > {output} rm {params.checkm} rm {params.checkm2} """ |
32 33 34 35 36 37 38 | shell: """ checkm data setRoot ~/checkm_database/ rm -rf {params.outdir} metawrap bin_refinement -o {params.outdir} -t {threads} -A {params.metabat} -B {params.maxbin} -C {params.concoct} -c 50 -x 10 touch {output} """ |
49 50 51 52 53 54 | shell: """ rm -rf {params.folder} mkdir -p {params.folder} cp {params.refined_folder}/*.fa {params.folder} """ |
74 75 76 77 78 79 80 | shell: """ mkdir -p {params.dir1} mkdir -p {params.dir2} cp {input} {params.out1} cp {input} {params.out2} """ |
88 89 90 91 | shell: """ touch {output} """ |
105 106 107 108 109 110 | shell: """ checkm data setRoot ~/checkm_database/ rm -rf {params.outdir} checkm lineage_wf -t {threads} -x {params.ext} --tab_table -f {output} {params.indir} {params.outdir} """ |
121 122 123 124 125 126 127 128 129 | shell: """ sed -i '1d' {input} cut -f1,12,13,14 {input} | tr '\t' ','>{params.checkm} sed 's/,/.fa,/' {params.checkm}>{params.checkm2} echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm2} > {output} rm {params.checkm} rm {params.checkm2} """ |
141 142 143 144 | shell: """ Rscript scripts/plotting/plot_checkm_mags.R {input.sr} {input.coas} """ |
32 33 34 35 36 37 | shell: """ echo {threads} prefetch {params.run} && vdb-validate {params.run} && parallel-fastq-dump --threads {threads} \ --outdir {params.outdir} --skip-technical --split-3 --sra-id {params.run} --gzip """ |
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | usage() { cat << EOF usage: $0 options CMseq workflow to infer strain heterogeneity of MAG. 'bsub' with '-M 5000' OPTIONS: -t Number of threads [REQUIRED] -i Input first or only read (.fastq or .fastq.gz format) [REQUIRED] -n Input reverse read (.fastq or fastq.gz [OPTIONAL] -r MAG genome file (.fa or .fasta) [REQUIRED] -g MAG prokka output (.gff) -o Output files prefix (include path) [REQUIRED] EOF } # variables threads= ref= reads= reads2= outprefix= gff= while getopts "t:m:i:n:r:o:g:" OPTION do case ${OPTION} in t) threads=${OPTARG} ;; i) reads=${OPTARG} ;; n) reads2=${OPTARG} ;; r) ref=${OPTARG} ;; o) outprefix=${OPTARG} ;; g) gff=${OPTARG} ;; ?) usage exit ;; esac done # check arguments if [[ -z ${threads} ]] || [[ -z ${reads} ]] || [[ -z ${ref} ]] || [[ -z ${outprefix} ]] || [[ -z ${gff} ]] then echo "ERROR : Please supply correct arguments" usage exit 1 fi timestamp() { date +"%H:%M:%S" } readname=$(basename ${outprefix}) refix=$(basename ${ref%%.fa*}) if [ ${threads} -eq 1 ] then threads_sam=1 else threads_sam=$((${threads}-1)) fi # index reference file if [[ ! -s ${outprefix}.1.bt2 ]] then echo "$(timestamp) [ CMseq ] Indexing MAG FASTA file" bowtie2-build ${ref} ${outprefix} else echo "$(timestamp) [ CMseq ] Index found, skipping bowtie2-build" fi # initial mapping and sorting echo "$(timestamp) [ CMseq ] Mapping reads against MAG ..." if [[ -z ${reads2} ]] then bowtie2 --very-sensitive-local -x ${outprefix} -p ${threads} -U ${reads} | samtools view -@ ${threads_sam} -uS - | samtools sort -@ ${threads_sam} - -o ${outprefix}.bam samtools index -@ ${threads_sam} ${outprefix}.bam else bowtie2 --very-sensitive-local -x ${outprefix} -p ${threads} -1 ${reads} -2 ${reads2} | samtools view -@ ${threads_sam} -uS - | samtools sort -@ ${threads_sam} - -o ${outprefix}.bam samtools index -@ ${threads_sam} ${outprefix}.bam fi # calculate polymorphic sites echo "$(timestamp) [ CMseq ] Checking polymorphic sites ..." scripts/polymut.py -c ${ref} ${outprefix}.bam --gff_file ${gff} --mincov 10 --minqual 30 --dominant_frq_thrsh 0.8 > ${outprefix}.cmseq.csv rm -f ${outprefix}.bam ${outprefix}.bai |
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | library(ggplot2) library(gridExtra) args <- commandArgs(trailingOnly = TRUE) checkm_sr=read.csv(args[1],header=TRUE,stringsAsFactor=FALSE) checkm_coas=read.csv(args[2],header=TRUE,stringsAsFactor=FALSE) dnadiff=read.table(args[3],header=TRUE,stringsAsFactor=FALSE) colnames(dnadiff)=c("Ref","genome","Ref_length", "Refcovered","Query_length","Queryaligned", "ANI") dnadiff$genome= gsub(".*/","",dnadiff$genome) dnadiff$genome=gsub(".fa","",dnadiff$genome) checkm_sr$`Assembly approach`="Single run" checkm_coas$`Assembly approach`="Co-assembly" checkm=rbind.data.frame(checkm_sr,checkm_coas) checkm$Quality="Medium quality" checkm$Quality[checkm$completeness>=90&checkm$contamination<=5]="High quality" checkm$genome=gsub(".fa","",checkm$genome) comb=merge(dnadiff,checkm,by="genome") left<-ggplot(comb, aes(x = Queryaligned, y = Refcovered, color=`Assembly approach`, shape=Quality)) + geom_point(stroke = 1,size=1)+xlab("MAG Aligned (%)")+ylab("Reference Aligned (%)") + theme_classic() +scale_color_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+xlim(0,100)+ ylim(0,100)+scale_shape_manual(values = c(1,2))+guides(color=guide_legend(title="Approach")) right<- ggplot(comb, aes(x=ANI,color=`Assembly approach`)) + geom_density()+theme_classic()+xlab("ANI") +xlim(0,100)+ylab("Density")+scale_color_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+guides(color=guide_legend(title="Approach")) p<-grid.arrange(left, right, nrow = 1) ggsave("data/figures/dnadiff.png",p, width=10,height=5) |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | library(ggplot2) args <- commandArgs(trailingOnly = TRUE) checkm_sr=read.csv(args[1],header=TRUE,stringsAsFactor=FALSE) checkm_coas=read.csv(args[2],header=TRUE,stringsAsFactor=FALSE) checkm_sr$Status2="Single run" checkm_coas$Status2="Co-assembly" checkm=rbind.data.frame(checkm_sr,checkm_coas) checkm$Status=factor(checkm$Status, levels=c("Single run", "Co-assembly")) setwd("data/figures/") ggplot(checkm, aes(x=Status, y=completeness, fill=Status)) + geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Completeness") + xlab("Assembly approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+guides(color=guide_legend(title="Approach"))+theme(legend.position="none") ggsave("checkm_completeness.png",plot = last_plot()) ggplot(checkm, aes(x=Status, y=contamination, fill=Status))+geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Contamination") + xlab("Assembly approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+theme(legend.position="none")+ylim(0,10) ggsave("checkm_contam.png",width=5,height=5) |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | library(ggplot2) args <- commandArgs(trailingOnly = TRUE) singlerun=read.table(args[1],header=FALSE,stringsAsFactor=FALSE) coassembly=read.table(args[2],header=FALSE,fill=TRUE,stringsAsFactor=FALSE) colnames(singlerun)=c("path","strain") colnames(coassembly)=c("path","strain") coassembly$status="Co-assembly" singlerun$status="Single run" coassembly$strain=as.numeric(as.vector(coassembly$strain)) singlerun$strain=as.numeric(as.vector(singlerun$strain)) coassembly=coassembly[complete.cases(coassembly$strain),] singlerun=singlerun[complete.cases(singlerun$strain),] all=rbind(coassembly,singlerun) all$status=factor(all$status, levels=c("Single run", "Co-assembly")) ggplot(all, aes(x=status, y=strain, fill=status)) +geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Strain Heterogeneity") + xlab("Approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+guides(color=guide_legend(title="Approach"))+ylim(0,12)+theme(legend.position="none") ggsave("data/figures/cmseq_plot.png",width=5,height=5) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | library(ggplot2) library(gridExtra) args <- commandArgs(trailingOnly = TRUE) readcounts=read.table(args[1],header=TRUE,stringsAsFactor=FALSE) df_flag=read.table(args[2],header=FALSE,stringsAsFactor=FALSE) bwa.counts_sr=read.table(args[3],header=FALSE,stringsAsFactor=FALSE) bwa.counts_coas=read.table(args[4],header=FALSE,stringsAsFactor=FALSE) summary_out=args[5] readcounts$Run=gsub("data/00_preprocessing/processed/singlerun/","",readcounts$Run) readcounts=readcounts[!grepl("_2.fastq", readcounts$Run),] readcounts$Run=gsub("_1.fastq","",readcounts$Run) readcounts$Readcount=2*readcounts$Readcount colnames(df_flag)=c("Run","Assembly") df_comb=merge(df_flag,readcounts,by="Run") df_comb$percassemb=df_comb$Assembly/df_comb$Readcount*100 colnames(bwa.counts_sr)=c("Run","Catalogue") bwa.counts_sr$`Assembly Approach`="Single run" colnames(bwa.counts_coas)=c("Run","Catalogue") bwa.counts_coas$`Assembly Approach`="Coassembly" bwa.counts=rbind.data.frame(bwa.counts_sr,bwa.counts_coas) merged=merge(df_comb,bwa.counts, by="Run") merged$percmags=merged$Catalogue/merged$Readcount*100 merged$percassemb=merged$Assembly/merged$Readcount*100 write.csv(merged,summary_out, quote=FALSE,row.names=FALSE) ggplot(merged, aes(x=percmags, y=percassemb, color=`Assembly Approach`)) + geom_point() +xlab("Reads mapping to MAGs (%)") + ylab("Reads mapping to assembly (%)")+theme_classic()+scale_color_manual(breaks=c("Coassembly","Single run"), values=c("#a8ddb5","#c994c7"))+xlim(0,100)+ylim(0,100) ggsave("data/figures/perassemb_perref.png",width=5,height=5) |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | library(RColorBrewer) library(tidyr) library(ggplot2) args <- commandArgs(trailingOnly = TRUE) gtdb.taxa=read.delim(args[1],header=TRUE,stringsAsFactor=FALSE) ranks = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species") gtdb.taxa = separate(data = gtdb.taxa, col = classification, sep = ";", into = ranks) gtdb.bac = gtdb.taxa[which(gtdb.taxa$Domain == "d__Bacteria"),] # calculate bacteria proportions if(exists("gtdb.fi.bac")){ rm(gtdb.fi.bac) } for (r in ranks[2:(length(ranks)-1)]){ total = nrow(gtdb.bac) #total number of genomes rank.present = table(gtdb.bac[!grepl("__$", gtdb.bac[,r]),r]) #summary of genus total.rank = sum(rank.present) #sum of genus counts gtdb.ranks = data.frame(sort(rank.present)[(length(rank.present)-6):length(rank.present)]) gtdb.ranks$Prop = gtdb.ranks$Freq/total*100 other.name = paste(tolower(substr(r, 1, 1)), "__Other", sep="") novel.name = paste(tolower(substr(r, 1, 1)), "__Novel", sep="") other.freq = total.rank-sum(gtdb.ranks$Freq) novel.freq=total-total.rank #novel genomes other.prop = other.freq/total*100 novel.prop = novel.freq/total*100 other.ranks = data.frame(Var1=other.name, Freq=other.freq, Prop=other.prop) novel.ranks = data.frame(Var1=novel.name, Freq=novel.freq, Prop=novel.prop) ranks.fi = rbind(other.ranks, gtdb.ranks) ranks.fi = rbind(novel.ranks, ranks.fi) ranks.fi$Rank = r ranks.fi$Colour=c("#999999","#CAB2D6","#A6CEE3","#FF7F00","#FB9A99","#E31A1C","#B2DF8A","#33A02C","#1F78B4") if(exists("gtdb.fi.bac")){ gtdb.fi.bac = rbind(gtdb.fi.bac, ranks.fi) } else { gtdb.fi.bac = ranks.fi } } gtdb.fi.bac$Rank=factor(gtdb.fi.bac$Rank,levels=c("Phylum","Class","Order","Family","Genus")) # plot stacked plot taxa counts p<-print(ggplot(gtdb.fi.bac, aes(x=Rank, y=Prop, fill=Var1)) + geom_bar(stat="identity", colour="darkgrey", alpha=0.5, size=0.2, width=0.7) + theme_bw() + ylab("Proportion of species (%)") + coord_flip() + scale_fill_manual(values=as.vector(gtdb.fi.bac$Colour), name="Taxa") #+ guides(fill=FALSE) + scale_x_discrete(limits=ranks[(length(ranks)-1):2]) + theme(legend.position="right") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + theme(axis.title.x = element_text(size=14)) + theme(axis.text.y = element_text(size=12)) + theme(axis.title.y = element_blank()) + theme(axis.text.x = element_text(size=12))) ggsave("data/figures/gtdb_bacteria.png",p,width=15) |
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | import argparse import sys def ren_fasta(args): n = 0 for line in open(args.fasta_file, "rU"): if line[0] == ">": name = line.strip("\n").replace(">","") n += 1 print ">%s_%i\t%s" % (args.prefix, n, name) else: print line.strip("\n") if __name__ == '__main__': parser = argparse.ArgumentParser(description='Rename multifasta file') parser.add_argument('-f', dest='fasta_file', help='Input FASTA file') parser.add_argument('-p', dest='prefix', help='Header prefix') if len(sys.argv) == 1: parser.print_help() sys.exit() else: args = parser.parse_args() ren_fasta(args) |
Support
- Future updates
Related Workflows





