Recovery of prokaryotic genomes from shotgun metagenomic sequencing data

public 1yr ago 0 bookmarks

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

MAG Snakemake Workflow

Overview
System Requirements
Workflow setup
Running pipeline
CPU time
License
Issues
Citation

Overview

This pipeline can be used for recovery and quality assessment of prokaryotic MAGs from short-read, host-associated metagenomic datasets. The data analyzed is specificed using two files, which detail the co-assembly samples and single runs that are to be analyzed. The file runs.txt has the SRA accession for the single runs with a different accession in each line. The file coassembly_runs.txt specifies the co-assembly samples. This file is in tabular format with three columns, the first column specifies the name of the resulting co-assembly sample. The r1 and r2 columns specify the path of the forward and reverse reads constituting each co-assembly sample, with each read path separated by a comma. In this pipeline, we used a small subset of gut dataset previously analyzed by Almeida et al. There are 40 single runs and 2 co-assembly samples. The co-assembly samples are named based on the metadata that was chosen to perform co-assembly. If the runs are not publicly available, they should be placed in the subdirectory data/raw with respect to the Snakefile. The forward and reverse reads must have the extensions _1.fastq and _2.fastq respectively. Note that the version of kneaddata used in this pipeline still requires the read suffixes /1 and /2, so the headers must be formatted accordingly. If the runs are publicly available, specify them in the runs.txt file and the sra_download module will download the runs from the SRA using their SRA accession to the directory data/raw.

System Requirements

Hardware Requirements

HPC with at least 500 gigabytes of memory

The CPU times below are generated using the cluster config file included in the repo

Software Requirements

MAG Snakemake pipeline (https://github.com/Finn-Lab/MAG_Snakemake_wf)

Singularity 3.5.0 (https://github.com/hpcng/singularity)

Snakemake (version 5.18) (https://github.com/snakemake/snakemake)

Running the MAG Snakemake pipeline will automatically download the sequencing data from the SRA. It will also download the relevant singularity containers so the relevant software needed for our pipeline can be used. Alternatively, the tools can be manually downloaded from:

ncbi-genome-download (version 0.3.0) (https://github.com/kblin/ncbi-genome-download)

mash (version 2.2.1) (https://github.com/marbl/Mash)

parallel-fastq-dump (version 0.6.6) & fastq-dump (version 2.8.0) (https://github.com/rvalieris/parallel-fastq-dump)

fastqc (version 0.11.7) (https://github.com/s-andrews/FastQC)

multiqc (version 1.3) (https://github.com/ewels/MultiQC)

kneaddata (version 0.7.4) with Trimmomatic (version 0.39) & Bowtie (version 2.4.2) (https://github.com/biobakery/kneaddata)

metaSPAdes (version 3.14.0) (https://github.com/ablab/spades)

metaWRAP (version 1.2.2) (https://github.com/bxlab/metaWRAP)

CheckM (version 1.0.12) (https://github.com/Ecogenomics/CheckM)

Bowtie (version 2.4.1) (https://github.com/BenLangmead/bowtie2)

Prokka (version 1.14.5) (https://github.com/tseemann/prokka)

CMSeq (version 1.0) (https://bitbucket.org/CibioCM/cmseq)

mummer (version 3.23) (https://github.com/mummer4/mummer)

dRep (version 2.3.2) (https://github.com/MrOlm/drep)

GTDB_Tk (version 1.2.0) (https://github.com/Ecogenomics/GTDBTk)

bwa (version 0.7.17) (https://github.com/lh3/bwa)

samtools (version 1.9) (https://github.com/samtools/samtools)

Other

RefSeq complete bacterial genomes (downloaded May 2020) (https://www.ncbi.nlm.nih.gov/refseq/)

GTDB database (release 95) (https://data.ace.uq.edu.au/public/gtdb/data/releases/)

Workflow setup

Download the GTDB database using:

wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/auxillary_files/gtdbtk_r95_data.tar.gz

Download all RefSeq bacterial genomes using:

ncbi-genome-download bacteria --formats fasta --section refseq --assembly-levels complete

Next generate a Mash sketch of the database with default k-mer and sketch size from the main directory using:

mash sketch -o refseq.msh /path/to/RefSeq/*fasta

Download the code for the pipeline from (https://github.com/Finn-Lab/MAG_Snakemake_wf) in a location that has at least 1.5 TB of disk space. Change directory to this folder. Move the GTDB database and the RefSeq Mash sketch to the subfolder /data/databases using:

cd /path/to/MAG_Snakemake_wf/
mkdir -p data/databases
mv /path/to/refseq.msh data/databases
mv /path/to/gtdbtk_r95_data.tar.gz data/databases
tar -xvzf data/databases/gtdbtk_r95_data.tar.gz

Install snakemake into an environment using:

conda create -c conda-forge -c bioconda -n snakemake snakemake=5.18

Then activate the environment before using snakemake:

conda activate snakemake

Running pipeline

Submitting jobs

To run pipeline on the small gut dataset specified in runs.txt and coassembly_runs.txt, submit jobs with SLURM scheduler:

snakemake --use-singularity --restart-times 3 -k -j 50 --cluster-config clusterconfig.yaml --cluster "sbatch -n {cluster.nCPU} --mem {cluster.mem} -e {cluster.error} -o {cluster.output} -t {cluster.time}"

Submit jobs with LSF scheduler:

snakemake --use-singularity --restart-times 3 -k --jobs 50 --cluster-config clusterconfig.yaml --cluster "bsub -n {cluster.nCPU} -M {cluster.mem} -e {cluster.error} -o {cluster.output}"

CPU-time

The CPU time for the demo dataset and the cluster configuration file provided is as follows:

Data Download: 16 hours

Preprocessing: 48 hours

Assembly/Co-assembly: 2600 hours

Binning: 100 hours

Quality Assessment & Bin Refinement; Estimate completeness and contamination of MAGs: 25 hours

Quality Assessment & Bin Refinement; Estimate strain heterogeneity of MAGs: 700 hours

Quality Assessment & Bin Refinement; Compare MAGs to RefSeq genomes: 1 hour

Quality Assessment & Bin Refinement; Bin refinement: 260 hours

Dereplicate MAGs: 4 hours

Taxonomic Classification: 3 hours

Evaluate Bottlenecks: 120 hours

Citation

For a walk-through of this pipeline, please visit: Saheb Kashaf, S., Almeida, A., Segre, J.A. et al. Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data. Nat Protoc (2021). https://doi.org/10.1038/s41596-021-00508-2

To generate results from the paper associated with this pipeline, please select a figure to reproduce in the Snakefile.

Code Snippets

shell:
    """
    rm -rf {params.outdir}
    dmesg -T
    spades.py --meta -1 {input.fwd} -2 {input.rev} \
    -o {params.outdir} -t {threads} -m {resources.mem}
    """

SnakeMake SPAdes From line 34 of modules/assembly.Snakefile

shell:
    """
    rm -rf {params.outdir}
    dmesg -T
    spades.py --meta -1 {input.fwd} -2 {input.rev} \
    -o {params.outdir} -t {threads} -m {resources.mem}
    """

SnakeMake SPAdes From line 58 of modules/assembly.Snakefile

shell:
    """
    rm -rf {params.outdir}
    metawrap binning -t {threads} -m {resources.mem}\
    -a {input.fasta} --maxbin2 --metabat2 --concoct \
    -l {params.mincontiglength} -o {params.outdir} {input.fwd} {input.rev}
    touch {output.outfile}
    """

SnakeMake MetaWRAP From line 69 of modules/binning.Snakefile

shell:
    """
    rm -rf {params.outdir}
    metawrap binning -t {threads} -m {resources.mem} \
    -a {params.assembly} --metabat2 --maxbin2 --concoct \
    -l {params.mincontiglength} -o {params.outdir} {params.reads}
    touch {output}
    """

SnakeMake MetaWRAP From line 97 of modules/binning.Snakefile

shell:
    """
    scripts/rename_multifasta_prefix.py -f {input.MAG} -p {params.name} > {output}
    """

SnakeMake From line 40 of modules/cmseq.Snakefile

shell:
    """
    prokka {input} --kingdom Bacteria --outdir {params.out_prokka} \
    --prefix {params.prefix} --force --locustag {params.prefix} --cpus {threads}
    """

SnakeMake metaprokka From line 64 of modules/cmseq.Snakefile

shell:
    """
    scripts/rename_multifasta_prefix.py -f {input.MAG} -p {params.name} > {output}
    """

SnakeMake From line 82 of modules/cmseq.Snakefile

shell:
    """
    prokka {input} --kingdom Bacteria --outdir {params.out_prokka} \
    --prefix {params.prefix} --force --locustag {params.prefix} --cpus {threads}
    """

SnakeMake metaprokka From line 106 of modules/cmseq.Snakefile

shell:
    """
    scripts/cmseq.sh -t {threads} -i {input.r1} -n {input.r2} -r {input.MAG} -g {input.prokka} -o {params.name}
    """

SnakeMake From line 135 of modules/cmseq.Snakefile

shell:
    """
    rm -f {output.cmseq}
    rm -f {output.done}
    for i in {params.r1}; do run=$(basename ${{i}} _1.fastq); scripts/cmseq.sh\
    -t {threads} -i ${{i}} -n ${{i%%_1.fastq}}_2.fastq -r {input.MAG} -g {input.prokka}\
    -o {params.name}_${{run}}; awk 'NR==2' {params.name}_${{run}}.cmseq.csv >> {output.cmseq};done

    touch {output.done}
    """

SnakeMake From line 172 of modules/cmseq.Snakefile

run:
    if os.path.exists(str(output)):
        os.remove(str(output))
    else:
        print("Aggregating single run strain heterogeneity results")
    path = str(params)
    file1 = open(str(output), "w")
    for filename in glob.glob(os.path.join(path, "*.csv")):
        with open(filename, "r") as f:
            lines = f.readlines()
            if len(lines) > 1:
                lines_sub = lines[1].strip()
                L = filename + "\t" + lines_sub + "\n"
                file1.writelines(L)
    file1.close()

SnakeMake From line 191 of modules/cmseq.Snakefile

run:
    if os.path.exists(str(output)):
        os.remove(str(output))
    else:
        print("Aggregating coassembly strain heterogeneity results")
    path = str(params)
    file1 = open(str(output), "w")
    for filename in glob.glob(os.path.join(path, "*SRR*.csv")):
        with open(filename, "r") as f:
            lines = f.readlines()
            if len(lines) > 1:
                lines_sub = lines[1].strip()
                L = filename + "\t" + lines_sub + "\n"
                file1.writelines(L)
    file1.close()

SnakeMake From line 216 of modules/cmseq.Snakefile

shell:
    """
    cat {input}>{output}         
    """

SnakeMake From line 239 of modules/cmseq.Snakefile

shell:
    """
    cat {input}>{output}
    """

SnakeMake From line 250 of modules/cmseq.Snakefile

shell:
    """
    Rscript scripts/plotting/plot_cmseq.R {input.sr} {input.coas}
    """

SnakeMake From line 264 of modules/cmseq.Snakefile

shell:
    """
    cat {input.r1} > {output.r1}
    cat {input.r2} > {output.r2}
    """

SnakeMake From line 35 of modules/coas.Snakefile

shell:
    """
    repair.sh in={input.fwd} in2={input.rev} out={output.fwd} out2={output.rev} repair
    rm -f {input.fwd}
    rm -f  {input.rev}
    """

SnakeMake From line 51 of modules/coas.Snakefile

shell:
    """
    rm -f {params.fwd}
    rm -f {params.rev}
    rm -f {output.fwd}
    rm -f {output.rev}
    gzip {input.fwd}
    gzip {input.rev}
    mv {params.fwd} {output.fwd}
    mv {params.rev} {output.rev} 
    """

SnakeMake From line 69 of modules/coas.Snakefile

run:
    outfile = str(output)
    if os.path.exists(outfile):
        os.remove(outfile)
    with open(outfile, "w") as outf:
        outf.writelines("Run\tReadcount\n")
        for run in input:
            readcount = int(linecount(run))
            line = run + "\t" + str(readcount)
            outf.writelines(line + "\n")

SnakeMake From line 89 of modules/coas.Snakefile

shell:
    "mash dist -p {threads} {input.db} {input.bins} > {output}"

SnakeMake Mash From line 29 of modules/dnadiff.Snakefile

shell:
    """
    sort -gk3 {input.mashdist}|sed -n 1p >{output}
    """

SnakeMake From line 43 of modules/dnadiff.Snakefile

shell:
    """
    while read col1 col2 rem
      do
        echo 'dnadiff ${{col1}} ${{col2}} -p ${{col1%%.fasta}}_${{col2%%.fa}}_'
        dnadiff ${{col1}} ${{col2}} -p {params.outdir}/{params.bins}
      done < {input.bestmash}
    """

SnakeMake From line 62 of modules/dnadiff.Snakefile

run:
    outfile = str(output)
    f = open(input.dnadiff)
    data = f.read()
    first_line = data.split("\n", 1)[0]
    a = first_line.split(" ")
    ref = a[0]
    quer = a[1]
    with open(outfile, "w") as outf:
        path_dna = input.dnadiff
        base = os.path.basename(path_dna)
        base = base.split(".report")[0]
        with open(path_dna) as f:
            for line in f:
                if "TotalBases" in line:
                    cols = line.split()
                    lenref = int(cols[1])
                    lenquer = int(cols[2])
                if "AlignedBases" in line:
                    cols = line.split()
                    aliref = cols[1].split("(")[-1].split("%")[0]
                    alique = cols[2].split("(")[-1].split("%")[0]
                if "AvgIdentity" in line:
                    cols = line.split()
                    ident = float(cols[1])
        line = "%s\t%s\t%i\t%.2f\t%i\t%.2f\t%.2f" % (ref, quer, lenref, float(aliref), lenquer, float(alique), float(ident))
        outf.writelines(line + "\n")

SnakeMake From line 83 of modules/dnadiff.Snakefile

shell:
    """
    cat {input}>{output}
    """

SnakeMake From line 118 of modules/dnadiff.Snakefile

shell:
    """
    cat {input}>{output}
    """

SnakeMake From line 129 of modules/dnadiff.Snakefile

shell:
    """
    cat {input}>{output}
    """

SnakeMake From line 147 of modules/dnadiff.Snakefile

shell:
    """
    Rscript scripts/plotting/dnadiff_plot.R {input.checkm_sr} {input.checkm_coas} {input.summ}
    """

SnakeMake From line 162 of modules/dnadiff.Snakefile

shell:
    """
     rm -rf {params.outdir}
     dRep dereplicate -p {threads} {params.outdir} -g {params.indir}/*.fa \
    -pa 0.9 -sa 0.95 -nc 0.30 -cm larger --genomeInfo {input.checkm} -comp 50 -con 5
    """

SnakeMake dRep From line 31 of modules/dRep_GTDB.Snakefile

shell:
    """
    awk 'FNR!=1' {input.checkm} {input.checkm_coas}>{params.checkm}
    echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm} > {output}
    rm {params.checkm}
    """

SnakeMake From line 47 of modules/dRep_GTDB.Snakefile

shell:
    """
    rm -rf {params.outdir}
    mkdir -p {params.indir}
    rsync {params.singlerun_dRep}/* {params.indir}
    rsync {params.all_coas}/* {params.indir}
    dRep dereplicate -p {threads} {params.outdir} -g {params.indir}/*.fa \
    -pa 0.9 -sa 0.95 -nc 0.30 -cm larger --genomeInfo {input.checkm} -comp 50 -con 5
    """

SnakeMake dRep From line 70 of modules/dRep_GTDB.Snakefile

shell:
    """
    real=$(realpath {input.gtdbrelease})
    rm -rf {params.outdir}
    export GTDBTK_DATA_PATH=${{real}}
    gtdbtk classify_wf --cpus {threads} --genome_dir {params.indir} --out_dir {params.outdir} -x {params.ext}
    """

SnakeMake gtdbtk From line 94 of modules/dRep_GTDB.Snakefile

shell:
    """
    Rscript scripts/plotting/plot_gtdb.R {input}
    """

SnakeMake From line 110 of modules/dRep_GTDB.Snakefile

shell:
    """
    rm -rf {params.dir}
    mkdir -p {params.dir}
    scp {input.scaffold} {params.scaffold}
    bwa index {params.scaffold}
    bwa mem -t {threads} {params.scaffold} {input.fwd} {input.rev} | samtools view -bS - | \
    samtools sort -@ {threads} -o {params.alignedsorted} -
    samtools index {params.alignedsorted}
    samtools flagstat {params.alignedsorted} > {output.flagstat}
    rm {params.alignedsorted}
    """

SnakeMake SAMtools BWA scp From line 33 of modules/framework.Snakefile

shell:
    """
    crimson flagstat {input}>{output}
    """

SnakeMake CRIMSON From line 54 of modules/framework.Snakefile

run:
    with open(str(output), "w") as outf:
        for i in input:
            run = str(i).split("singlerun/")[1]
            run = run.split("/mapreads/")[0]
            dict = eval(open(str(i), "r").read())
            supplementary = dict["pass_qc"]["supplementary"]
            secondary = dict["pass_qc"]["secondary"]
            mapped = dict["pass_qc"]["mapped"]
            total = dict["pass_qc"]["total"]
            perc = (mapped - supplementary - secondary) 
            outf.write(str(run) + "\t" + str(perc) + "\n")
    outf.close()

SnakeMake From line 65 of modules/framework.Snakefile

shell:
    """
    cat {params.indir}/*fa>{output}
    bwa index {output}
    """

SnakeMake BWA From line 94 of modules/framework.Snakefile

shell:
    """
    cat {params.indir}/*.fa>{output}
    bwa index {output}
    """

SnakeMake BWA From line 110 of modules/framework.Snakefile

shell:
    """
    bwa mem -t {threads} {input.catalogue} {input.fwd} {input.rev} \
    | samtools view -bS - | samtools sort -@ {threads} -o {params.alignedsorted} -
    samtools index {params.alignedsorted}
    samtools flagstat {params.alignedsorted} > {output.flagstat}
    rm {params.alignedsorted}
    """

SnakeMake SAMtools BWA From line 128 of modules/framework.Snakefile

shell:
    """
    bwa mem -t {threads} {input.catalogue} {input.fwd} {input.rev}\
    | samtools view -bS - | samtools sort -@ {threads} -o {params.alignedsorted} -
    samtools index {params.alignedsorted}
    samtools flagstat {params.alignedsorted} > {output.flagstat}
    rm {params.alignedsorted}
    """

SnakeMake SAMtools BWA From line 149 of modules/framework.Snakefile

shell:
    """
    crimson flagstat {input}>{output}
    """

SnakeMake CRIMSON From line 166 of modules/framework.Snakefile

shell:
    """
    crimson flagstat {input}>{output}
    """

SnakeMake CRIMSON From line 179 of modules/framework.Snakefile

run:
    with open(str(output), "w") as outf:
        for i in input:
            base = os.path.basename(str(i))
            run = os.path.splitext(base)[0]
            dict = eval(open(str(i), "r").read())
            supplementary = dict["pass_qc"]["supplementary"]
            secondary = dict["pass_qc"]["secondary"]
            mapped = dict["pass_qc"]["mapped"]
            total = dict["pass_qc"]["total"]
            perc = (mapped - supplementary - secondary) 
            outf.write(str(run) + "\t" + str(perc) + "\n")
    outf.close()

SnakeMake From line 190 of modules/framework.Snakefile

run:
    with open(str(output), "w") as outf:
        for i in input:
            base = os.path.basename(str(i))
            run = os.path.splitext(base)[0]
            dict = eval(open(str(i), "r").read())
            supplementary = dict["pass_qc"]["supplementary"]
            secondary = dict["pass_qc"]["secondary"]
            mapped = dict["pass_qc"]["mapped"]
            total = dict["pass_qc"]["total"]
            perc = (mapped - supplementary - secondary) 
            outf.write(str(run) + "\t" + str(perc) + "\n")
    outf.close()

SnakeMake From line 211 of modules/framework.Snakefile

shell:
    """
    Rscript scripts/plotting/plot_framework.R {input.readcounts} {input.flagstat} {input.mapreads_sr} {input.mapreads_coas} {params.summary}
    """

SnakeMake From line 241 of modules/framework.Snakefile

shell:
    """
    fastqc {input.fwd} --outdir {params.outdir}
    """

SnakeMake FastQC From line 30 of modules/preprocessing.Snakefile

shell:
    """
    fastqc {input.rev} --outdir {params.outdir}
    """

SnakeMake FastQC From line 47 of modules/preprocessing.Snakefile

shell:
    """
    rm -f {params.outfile}
    multiqc {params.indir} -o {params.outdir}
    mv {params.outfile} {output}
    """

SnakeMake MultiQC From line 65 of modules/preprocessing.Snakefile

shell:
    """
    kneaddata_database --download human_genome bowtie2 {params.outdir}
    """

SnakeMake Bowtie 2 From line 85 of modules/preprocessing.Snakefile

shell:
    """
    #this version of kneaddata requires read identifiers (/1, /2). Given we did not download the file from the sra using the option --readids, we need to remove spaces in the headers
    # and then add /1 and /2 using reformat.sh. Kneaddata by default does not sort the reads so Next we sort the reads. 
    # Alternatively, one could sort using kneaddata with the option --reorder
    rm -f {params.tmp_fwd} {params.tmp_rev} {params.tmp_fwd2} {params.tmp_rev2} {params.fwd} {params.rev} {output.fwd} {output.rev}
    mkdir -p {params.outdir}
    seqtk seq -C {input.fwd} > {params.tmp_fwd}
    seqtk seq -C {input.rev} > {params.tmp_rev}
    reformat.sh in={params.tmp_fwd} in2={params.tmp_rev} out1={params.tmp_fwd2} out2={params.tmp_rev2} addslash spaceslash=f
    kneaddata --remove-intermediate-output --threads {threads} \
    --input {params.tmp_fwd2} --input {params.tmp_rev2}\
    --output {params.outdir} \
    --reference-db {params.indx} \
    --trimmomatic-options "ILLUMINACLIP:/data/adapters/TruSeq3-PE.fa:2:30:10: SLIDINGWINDOW:4:20 MINLEN:50" --trimmomatic /data/\
    --bowtie2-options "--very-sensitive --dovetail" 
    repair.sh in={params.fwd} in2={params.rev} out={output.fwd} out2={output.rev} repair
    rm {params.tmp_fwd} {params.tmp_rev} {params.tmp_fwd2} {params.tmp_rev2}
    """

SnakeMake seqtk SRA From line 115 of modules/preprocessing.Snakefile

shell:
    """
    fastqc {input.fwd} --outdir {params.outdir}
    """

SnakeMake FastQC From line 147 of modules/preprocessing.Snakefile

shell:
    """
    fastqc {input.rev} --outdir {params.outdir}
    """

SnakeMake FastQC From line 163 of modules/preprocessing.Snakefile

shell:
    """
    multiqc --force {params.indir} -o {params.outdir}
    mv {params.outfile} {output}
    """

SnakeMake MultiQC From line 181 of modules/preprocessing.Snakefile

run:
    outfile = str(output)
    if os.path.exists(outfile):
        os.remove(outfile)
    with open(outfile, "w") as outf:
        outf.writelines("Run\tReadcount\n")
        for run in input:
            readcount = int(linecount(run))
            line = run + "\t" + str(readcount)
            outf.writelines(line + "\n")

SnakeMake From line 198 of modules/preprocessing.Snakefile

shell:
    """
    checkm data setRoot ~/checkm_database/
    rm -rf {params.outdir}
    rm -rf {output}
    metawrap bin_refinement -o {params.outdir} -t {threads} -A {params.metabat} -B {params.maxbin} -C {params.concoct} -c 50 -x 10
     touch {output}
    """

SnakeMake CheckM MetaWRAP From line 32 of modules/refine_coas.Snakefile

shell:
    """
    rm -rf {params.folder}
    mkdir -p {params.folder}
    cp {params.refined_folder}/*.fa {params.folder}
    """

SnakeMake From line 50 of modules/refine_coas.Snakefile

shell:
    """
    mkdir -p {params.dir1}
    mkdir -p {params.dir2}
    cp {input} {params.out1}
    cp {input} {params.out2}
    """

SnakeMake From line 74 of modules/refine_coas.Snakefile

shell:
    """
    touch {output}
    """

SnakeMake From line 88 of modules/refine_coas.Snakefile

shell:
    """
    checkm data setRoot ~/checkm_database/
    rm -rf {params.outdir}
    checkm lineage_wf -t {threads} -x {params.ext} --tab_table -f {output} {params.indir} {params.outdir}
    """

SnakeMake CheckM From line 105 of modules/refine_coas.Snakefile

shell:
    """
    sed -i '1d' {input}
    cut -f1,12,13,14 {input} | tr '\t' ','>{params.checkm}
    sed 's/,/.fa,/' {params.checkm}>{params.checkm2}
    echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm2} > {output}
    rm {params.checkm}
    rm {params.checkm2}
    """

SnakeMake From line 121 of modules/refine_coas.Snakefile

shell:
    """
    checkm data setRoot ~/checkm_database/ 
    rm -rf {params.outdir}
    metawrap bin_refinement -o {params.outdir} -t {threads} -A {params.metabat} -B {params.maxbin} -C {params.concoct} -c 50 -x 10
    touch {output}
    """

SnakeMake CheckM MetaWRAP From line 32 of modules/refine.Snakefile

shell:
    """
    rm -rf {params.folder}
    mkdir -p {params.folder}
    cp {params.refined_folder}/*.fa {params.folder}
    """

SnakeMake From line 49 of modules/refine.Snakefile

shell:
    """
    mkdir -p {params.dir1}
    mkdir -p {params.dir2}
    cp {input} {params.out1}
    cp {input} {params.out2}
    """

SnakeMake From line 74 of modules/refine.Snakefile

shell:
    """
    touch {output}
    """

SnakeMake From line 88 of modules/refine.Snakefile

shell:
    """
    checkm data setRoot ~/checkm_database/
    rm -rf {params.outdir}
    checkm lineage_wf -t {threads} -x {params.ext} --tab_table -f {output} {params.indir} {params.outdir}
    """

SnakeMake CheckM From line 105 of modules/refine.Snakefile

shell:
    """
    sed -i '1d' {input}
    cut -f1,12,13,14 {input} | tr '\t' ','>{params.checkm}
    sed 's/,/.fa,/' {params.checkm}>{params.checkm2}
    echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm2} > {output}
    rm {params.checkm}
    rm {params.checkm2}
    """

SnakeMake From line 121 of modules/refine.Snakefile

shell:
    """
    Rscript scripts/plotting/plot_checkm_mags.R {input.sr} {input.coas}
    """

SnakeMake From line 141 of modules/refine.Snakefile

shell:
    """
    echo {threads}
    prefetch {params.run} && vdb-validate {params.run} && parallel-fastq-dump --threads {threads} \
    --outdir {params.outdir} --skip-technical --split-3 --sra-id {params.run} --gzip
    """

SnakeMake From line 32 of modules/sra_download.Snakefile

usage()
{
cat << EOF
usage: $0 options

CMseq workflow to infer strain heterogeneity of MAG.

'bsub' with '-M 5000'

OPTIONS:
   -t      Number of threads [REQUIRED]
   -i      Input first or only read (.fastq or .fastq.gz format) [REQUIRED]
   -n      Input reverse read (.fastq or fastq.gz [OPTIONAL]
   -r      MAG genome file (.fa or .fasta) [REQUIRED]
   -g      MAG prokka output (.gff)
   -o      Output files prefix (include path) [REQUIRED]
EOF
}

# variables
threads=
ref=
reads=
reads2=
outprefix=
gff=

while getopts "t:m:i:n:r:o:g:" OPTION
do
     case ${OPTION} in
         t)
             threads=${OPTARG}
             ;;
         i)
             reads=${OPTARG}
             ;;
         n)
             reads2=${OPTARG}
             ;;
         r)
             ref=${OPTARG}
             ;;
         o)
             outprefix=${OPTARG}
             ;;
         g)
             gff=${OPTARG}
             ;;
         ?)
             usage
             exit
             ;;
     esac
done

# check arguments
if [[ -z ${threads} ]] || [[ -z ${reads} ]] || [[ -z ${ref} ]] || [[ -z ${outprefix} ]] || [[ -z ${gff} ]]
then
     echo "ERROR : Please supply correct arguments"
     usage
     exit 1
fi


timestamp() {
  date +"%H:%M:%S"
}

readname=$(basename ${outprefix})
refix=$(basename ${ref%%.fa*})
if [ ${threads} -eq 1 ]
then
    threads_sam=1
else
    threads_sam=$((${threads}-1))
fi

# index reference file
if [[ ! -s ${outprefix}.1.bt2 ]]
then
    echo "$(timestamp) [ CMseq ] Indexing MAG FASTA file"
    bowtie2-build ${ref} ${outprefix}
else
    echo "$(timestamp) [ CMseq ] Index found, skipping bowtie2-build"
fi


# initial mapping and sorting
echo "$(timestamp) [ CMseq ] Mapping reads against MAG ..."
if [[ -z ${reads2} ]]
then
    bowtie2 --very-sensitive-local -x ${outprefix} -p ${threads} -U ${reads} | samtools view -@ ${threads_sam} -uS - | samtools sort -@ ${threads_sam} - -o ${outprefix}.bam
    samtools index -@ ${threads_sam} ${outprefix}.bam
else
    bowtie2 --very-sensitive-local -x ${outprefix} -p ${threads} -1 ${reads} -2 ${reads2} | samtools view -@ ${threads_sam} -uS - | samtools sort -@ ${threads_sam} - -o ${outprefix}.bam
    samtools index -@ ${threads_sam} ${outprefix}.bam
fi

# calculate polymorphic sites
echo "$(timestamp) [ CMseq ] Checking polymorphic sites ..."
scripts/polymut.py -c ${ref} ${outprefix}.bam --gff_file ${gff} --mincov 10 --minqual 30 --dominant_frq_thrsh 0.8 > ${outprefix}.cmseq.csv
rm -f ${outprefix}.bam ${outprefix}.bai

Shell SAMtools Bowtie 2 metaprokka From line 18 of scripts/cmseq.sh

library(ggplot2)
library(gridExtra)

args <- commandArgs(trailingOnly = TRUE)

checkm_sr=read.csv(args[1],header=TRUE,stringsAsFactor=FALSE)
checkm_coas=read.csv(args[2],header=TRUE,stringsAsFactor=FALSE)
dnadiff=read.table(args[3],header=TRUE,stringsAsFactor=FALSE)


colnames(dnadiff)=c("Ref","genome","Ref_length", "Refcovered","Query_length","Queryaligned", "ANI")
dnadiff$genome= gsub(".*/","",dnadiff$genome)
dnadiff$genome=gsub(".fa","",dnadiff$genome)

checkm_sr$`Assembly approach`="Single run"
checkm_coas$`Assembly approach`="Co-assembly"
checkm=rbind.data.frame(checkm_sr,checkm_coas)

checkm$Quality="Medium quality"
checkm$Quality[checkm$completeness>=90&checkm$contamination<=5]="High quality"
checkm$genome=gsub(".fa","",checkm$genome)

comb=merge(dnadiff,checkm,by="genome")
left<-ggplot(comb, aes(x = Queryaligned, y = Refcovered, color=`Assembly approach`, shape=Quality)) +
         geom_point(stroke = 1,size=1)+xlab("MAG Aligned (%)")+ylab("Reference Aligned (%)") + theme_classic() +scale_color_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+xlim(0,100)+ ylim(0,100)+scale_shape_manual(values = c(1,2))+guides(color=guide_legend(title="Approach"))
right<-
ggplot(comb, aes(x=ANI,color=`Assembly approach`)) + geom_density()+theme_classic()+xlab("ANI") +xlim(0,100)+ylab("Density")+scale_color_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+guides(color=guide_legend(title="Approach"))

p<-grid.arrange(left, right, nrow = 1)
ggsave("data/figures/dnadiff.png",p, width=10,height=5)

R ggplot2 gridExtra From line 17 of plotting/dnadiff_plot.R

library(ggplot2)


args <- commandArgs(trailingOnly = TRUE)

checkm_sr=read.csv(args[1],header=TRUE,stringsAsFactor=FALSE)
checkm_coas=read.csv(args[2],header=TRUE,stringsAsFactor=FALSE)

checkm_sr$Status2="Single run"
checkm_coas$Status2="Co-assembly"

checkm=rbind.data.frame(checkm_sr,checkm_coas)
checkm$Status=factor(checkm$Status, levels=c("Single run", "Co-assembly"))

setwd("data/figures/")
ggplot(checkm, aes(x=Status, y=completeness, fill=Status)) +
  geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Completeness") + xlab("Assembly approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+guides(color=guide_legend(title="Approach"))+theme(legend.position="none")
ggsave("checkm_completeness.png",plot = last_plot())

ggplot(checkm, aes(x=Status, y=contamination, fill=Status))+geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Contamination") + xlab("Assembly approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+theme(legend.position="none")+ylim(0,10)
ggsave("checkm_contam.png",width=5,height=5)

R ggplot2 From line 16 of plotting/plot_checkm_mags.R

library(ggplot2)

args <- commandArgs(trailingOnly = TRUE)

singlerun=read.table(args[1],header=FALSE,stringsAsFactor=FALSE)
coassembly=read.table(args[2],header=FALSE,fill=TRUE,stringsAsFactor=FALSE)
colnames(singlerun)=c("path","strain")
colnames(coassembly)=c("path","strain")

coassembly$status="Co-assembly"
singlerun$status="Single run"
coassembly$strain=as.numeric(as.vector(coassembly$strain))
singlerun$strain=as.numeric(as.vector(singlerun$strain))
coassembly=coassembly[complete.cases(coassembly$strain),]
singlerun=singlerun[complete.cases(singlerun$strain),]

all=rbind(coassembly,singlerun)
all$status=factor(all$status, levels=c("Single run", "Co-assembly"))
ggplot(all, aes(x=status, y=strain, fill=status)) +geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Strain Heterogeneity") + xlab("Approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+guides(color=guide_legend(title="Approach"))+ylim(0,12)+theme(legend.position="none")
ggsave("data/figures/cmseq_plot.png",width=5,height=5)

R ggplot2 From line 16 of plotting/plot_cmseq.R

library(ggplot2)
library(gridExtra)

args <- commandArgs(trailingOnly = TRUE)

readcounts=read.table(args[1],header=TRUE,stringsAsFactor=FALSE)
df_flag=read.table(args[2],header=FALSE,stringsAsFactor=FALSE)
bwa.counts_sr=read.table(args[3],header=FALSE,stringsAsFactor=FALSE)
bwa.counts_coas=read.table(args[4],header=FALSE,stringsAsFactor=FALSE)
summary_out=args[5]

readcounts$Run=gsub("data/00_preprocessing/processed/singlerun/","",readcounts$Run)
readcounts=readcounts[!grepl("_2.fastq", readcounts$Run),]
readcounts$Run=gsub("_1.fastq","",readcounts$Run)
readcounts$Readcount=2*readcounts$Readcount

colnames(df_flag)=c("Run","Assembly")
df_comb=merge(df_flag,readcounts,by="Run")
df_comb$percassemb=df_comb$Assembly/df_comb$Readcount*100

colnames(bwa.counts_sr)=c("Run","Catalogue")
bwa.counts_sr$`Assembly Approach`="Single run"

colnames(bwa.counts_coas)=c("Run","Catalogue")
bwa.counts_coas$`Assembly Approach`="Coassembly"

bwa.counts=rbind.data.frame(bwa.counts_sr,bwa.counts_coas)
merged=merge(df_comb,bwa.counts, by="Run")
merged$percmags=merged$Catalogue/merged$Readcount*100
merged$percassemb=merged$Assembly/merged$Readcount*100

write.csv(merged,summary_out, quote=FALSE,row.names=FALSE)

ggplot(merged, aes(x=percmags, y=percassemb, color=`Assembly Approach`)) + geom_point() +xlab("Reads mapping to MAGs (%)") + ylab("Reads mapping to assembly (%)")+theme_classic()+scale_color_manual(breaks=c("Coassembly","Single run"), values=c("#a8ddb5","#c994c7"))+xlim(0,100)+ylim(0,100)
ggsave("data/figures/perassemb_perref.png",width=5,height=5)

R ggplot2 gridExtra From line 1 of plotting/plot_framework.R

library(RColorBrewer)
library(tidyr)
library(ggplot2)

args <- commandArgs(trailingOnly = TRUE)
gtdb.taxa=read.delim(args[1],header=TRUE,stringsAsFactor=FALSE)

ranks = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
gtdb.taxa = separate(data = gtdb.taxa, col = classification, sep = ";", into = ranks)
gtdb.bac = gtdb.taxa[which(gtdb.taxa$Domain == "d__Bacteria"),]


# calculate bacteria proportions
if(exists("gtdb.fi.bac")){
  rm(gtdb.fi.bac)
}
for (r in ranks[2:(length(ranks)-1)]){
  total = nrow(gtdb.bac)   #total number of genomes
  rank.present = table(gtdb.bac[!grepl("__$", gtdb.bac[,r]),r])   #summary of genus
  total.rank = sum(rank.present) #sum of genus counts 
  gtdb.ranks = data.frame(sort(rank.present)[(length(rank.present)-6):length(rank.present)])
  gtdb.ranks$Prop = gtdb.ranks$Freq/total*100
  other.name = paste(tolower(substr(r, 1, 1)), "__Other", sep="")
  novel.name = paste(tolower(substr(r, 1, 1)), "__Novel", sep="")

  other.freq = total.rank-sum(gtdb.ranks$Freq)
  novel.freq=total-total.rank  #novel genomes

  other.prop = other.freq/total*100
  novel.prop = novel.freq/total*100

  other.ranks = data.frame(Var1=other.name, Freq=other.freq, Prop=other.prop)
  novel.ranks = data.frame(Var1=novel.name, Freq=novel.freq, Prop=novel.prop)

  ranks.fi = rbind(other.ranks, gtdb.ranks)
  ranks.fi = rbind(novel.ranks, ranks.fi)

  ranks.fi$Rank = r
  ranks.fi$Colour=c("#999999","#CAB2D6","#A6CEE3","#FF7F00","#FB9A99","#E31A1C","#B2DF8A","#33A02C","#1F78B4")

  if(exists("gtdb.fi.bac")){
    gtdb.fi.bac = rbind(gtdb.fi.bac, ranks.fi)
  } else {
    gtdb.fi.bac = ranks.fi
  }
}

gtdb.fi.bac$Rank=factor(gtdb.fi.bac$Rank,levels=c("Phylum","Class","Order","Family","Genus"))
# plot stacked plot taxa counts
p<-print(ggplot(gtdb.fi.bac, aes(x=Rank, y=Prop, fill=Var1)) 
      + geom_bar(stat="identity", colour="darkgrey", alpha=0.5, size=0.2, width=0.7)
      + theme_bw()
      + ylab("Proportion of species (%)")
      + coord_flip()
      + scale_fill_manual(values=as.vector(gtdb.fi.bac$Colour), name="Taxa")
      #+ guides(fill=FALSE)
      + scale_x_discrete(limits=ranks[(length(ranks)-1):2])
      + theme(legend.position="right")
      + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
      + theme(axis.title.x = element_text(size=14))
      + theme(axis.text.y = element_text(size=12))
      + theme(axis.title.y = element_blank())
      + theme(axis.text.x = element_text(size=12)))



ggsave("data/figures/gtdb_bacteria.png",p,width=15)

R ggplot2 tidyr RColorBrewer From line 16 of plotting/plot_gtdb.R

import argparse
import sys

def ren_fasta(args):
	n = 0
	for line in open(args.fasta_file, "rU"):
		if line[0] == ">":
			name = line.strip("\n").replace(">","")
			n += 1
			print ">%s_%i\t%s" % (args.prefix, n, name)
		else:
			print line.strip("\n")


if __name__ == '__main__':
        parser = argparse.ArgumentParser(description='Rename multifasta file')
        parser.add_argument('-f', dest='fasta_file', help='Input FASTA file')
	parser.add_argument('-p', dest='prefix', help='Header prefix')
        if len(sys.argv) == 1:
                parser.print_help()
                sys.exit()
        else:
                args = parser.parse_args()
                ren_fasta(args)