EukRecover: Pipeline to recover eukaryotic MAGs using CONCOCT, metaBAT2 and EukCC's merging algorythm.

public 1yr ago Version: Version 1 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

EukRecover

Pipeline to recover eukaryotic MAGs using CONCOCT, metaBAT2 and EukCC's merging algorythm.

Needs paired end shotgun metagenomic reads.

Environment

Eukrecover requires an environment with snakemake and metaWRAP.

Quickstart

Define your samples in the file samples.csv . This file needs to have the columns project and run to identify each metagenome.

This pipeline does not support co-binning, but feel free to change it.

Clone this repro wherever you want to run the pipeline:

git clone https://github.com/openpaul/eukrecover/

You can then run the snakemake like so

snakemake --use-singularity

The pipeline used dockerhub to fetch all tools, so make sure you have singularity installed.

Prepare databases

The pipeline will setup databases for you, but if you already have a EukCC or a BUSCO 5 database you can use them by specifying the location in the file config/config.yaml

Output:

In the folder results you will find a folder MAGs which will contain a folder fa containing the actual MAG fastas. In addition you will find stats for each MAG in the table QC.csv .

This table contains the following columns:

name,eukcc_compl,eukcc_cont,BUSCO_C,BUSCO_M,BUSCO_D,BUSCO_F,BUSCO_tax,N50,bp

Code Snippets

shell:
    """
    gunzip -c {input.fq1} > {output.fq1}
    gunzip -c {input.fq2} > {output.fq2}

    metawrap binning \
        -o {params.out} \
        -t {threads} \
        -l {config[mincontiglength]} \
        -m {resources.mem_GB} \
        -a {input.scaffolds} \
        --concoct --metabat2 --maxbin2 {output.fq1} {output.fq2}

    # rename bins with name
    cd {params.out}concoct_bins
    for f in *.fa ; do mv -- "$f" "{wildcards.project}_{wildcards.sample}_concoct_$f" || true ; done
    cd {params.out}metabat2_bins
    for f in *.fa ; do mv -- "$f" "{wildcards.project}_{wildcards.sample}_metabat2_$f" || true ; done
    cd {params.out}maxbin2_bins
    for f in *.fa ; do mv -- "$f" "{wildcards.project}_{wildcards.sample}_maxbin2_$f" || true ; done

    # remove unbinned bins
    cd {params.out}
    mkdir unbinned
    mv *_bins/*unbinned.fa unbinned/ || true

    # remove temp files
    cd {params.out}
    rm -r work_files
    """

SnakeMake MetaWRAP From line 28 of rules/binning.smk

shell:
    """
    #rm -r {params.out}
    #mkdir -p {params.out}
    metawrap bin_refinement -o {params.out} \
        -A {input.concoct} -B {input.metabat2} \
        -m {resources.mem_GB} \
        -t {threads}\
        -c 50 -x 10 \
        --quick  || true

    mv {params.d} {output.out}
    cd {params.out}metawrap_bins
    for f in *.fa ; do mv -- "$f" "{wildcards.project}_{wildcards.sample}_metawrap_$f" || true ; done
    """

SnakeMake MetaWRAP From line 17 of rules/binrefine.smk

shell:
    """
    mkdir -p {params.d}
    cd {params.d}
    wget -r -np -R "index.html*"  https://busco-data.ezlab.org/v5/data/
    mv busco-data.ezlab.org/v5/data/* .
    rm -r busco-data.ezlab.org
    #extract all gzipped files
    cd lineages
    find . -type f -iname "*tar.gz" -exec rm "{{}}" \;
    cd ../placement_files
    find . -type f -iname "*tar.gz" -exec rm "{{}}" \;
    cd ../information
    find . -type f -iname "*tar.gz" -exec rm "{{}}" \;
    """

SnakeMake From line 13 of rules/busco.smk

shell:
    """
    FNA=$(realpath {input.fa})
    DB=$(realpath {input.db})
    mkdir -p {params.d}
    if [ ! -z "$(ls -A {params.d})" ]; then
      echo "Folder not empty. Deleting"
      rm -rf {params.d}
    fi
    mkdir -p {params.d}
    cd {params.d}
    busco  --offline \
          -i $FNA \
          -m genome \
          -o out \
          --auto-lineage-euk \
          --download_path $DB \
          -c {threads} || touch out/short_summary.specific_failed.out.txt

    cp out/short_summary.specific*.out.txt short_summary.specific.txt
    rm -r out
    """

SnakeMake BUSCO From line 42 of rules/busco.smk

shell:
    """
      minimap2 -ax sr -t {threads} {input.fasta} {input.fq1} {input.fq2}  | 
              samtools view -q 20 -Sb - | \
              samtools sort -@ {threads} -O bam - -o {output.bam}
              samtools index {output.bam}

      # cmseq
      breadth_depth.py \
          --combine \
          --mincov {params.mincov} \
          {output.bam} > {output.cov}
    """

SnakeMake SAMtools Minimap2 From line 25 of rules/cmseq.smk

shell:
    """
    mkdir -p {params.out}
    rm -r  {params.out}
    mkdir -p {params.out}
    BD=$(pwd)
    # prepare drep quality file
    echo "genome,completeness,contamination" > {output.quality}
    cat {input.concoct_eukcc}  {input.metabat2_eukcc}|\
        awk '{{if($2 - 5*$3 >=50){{print $0}}}}' |\
        sort -k 2,3 -n |\
        tr '\\t' ',' >> {output.quality}
    if [[ $(wc -l <{output.quality}) -lt 2  ]];
    then
        touch {output.quality}
        touch {output.genomes}
        mkdir -p {output.input_bins} || true
        mkdir -p {output.bins} || true
        exit 0
    fi

    cat {output.quality}
    mkdir {output.input_bins}
    cd {output.input_bins}
    ln -s {input.concoct}/*.fa . || true
    ln -s {input.concoct_merged}/*.fa . || true
    ln -s {input.metabat2}/*.fa . || true
    ln -s {input.metabat2_merged}/*.fa . || true
    ls
    ls > ../genomes.txt
    cd ..
    cat quality.csv | grep -v "genome,completeness,contaminatio" |\
        cut -f 1 -d , > .f
    grep -f .f genomes.txt | sed -e 's/^/input_genomes\//'  > .g
    mv .g genomes.txt
    rm .f
    cd $BD

    mkdir -p {params.out}
    cd {params.out}

    cat genomes.txt
    if [[ $(wc -l <genomes.txt) -lt 2  ]];
    then
        echo "No genomes"
        cp -r input_genomes dereplicated_genomes
    else
      dRep dereplicate -p {threads} \
          . -g genomes.txt \
          -pa 0.80 -sa 0.99 -nc 0.40 \
          -cm larger \
          --genomeInfo quality.csv\
          -comp 49 -con 21
    fi
    """

SnakeMake dRep From line 21 of rules/drep.smk

shell:
    """
    mkdir -p {params.out}
    rm -r  {params.out}
    mkdir -p {params.out}
    BD=$(pwd)
    cat {input.genomes} | sed -e 's/^/..\/raw\//'  > {output.genomes}

    mkdir -p {params.out}
    cd {params.out}

    N=$(wc -l genomes.txt) 
    if [[ $(wc -l <genomes.txt) -lt 2  ]];
    then
        cp -r {input.fa} dereplicated_genomes
    else
      dRep dereplicate -p {threads} \
          . -g genomes.txt \
          -pa 0.80 -sa 0.95 -nc 0.40 \
          -cm larger \
          --genomeInfo {input.quality}\
          -comp 49 -con 21
    fi
    ls dereplicated_genomes > MAGs.txt
    """

SnakeMake dRep From line 93 of rules/drep.smk

shell:
    """
    minimap2 -ax sr -t {threads} {input.scaffolds} {input.fq1} {input.fq2}  | 
            samtools view -q 20 -Sb - | \
            samtools sort -@ {threads} -O bam - -o {output.bam}
            samtools index {output.bam}
    """

SnakeMake SAMtools Minimap2 From line 21 of rules/eukcc.smk

shell:
    """
    # if no bins
    if [[ $( ls {input.bindir}/*.fa | wc -l ) -lt 1 ]]; then
      touch {output.links}
    else
      python3 scripts/binlinks.py  --ANI 99 --within 1500 \
          --out {output.links} {input.bindir} {input.bam}
    fi
    """

SnakeMake From line 43 of rules/eukcc.smk

shell:
    """
    wget -O {params.tar} http://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc2_db_ver_1.1.tar.gz
    cd {params.d}
    tar -xzvf eukcc2_db_ver_1.1.tar.gz
    """

SnakeMake From line 64 of rules/eukcc.smk

shell:
    """
    rm -r {params.out}workdir || true
    # if no bins
    if [[ $( ls {input.bindir} | wc -l ) -lt 1 ]]; then
      touch {output.csv}
      mkdir -p {output.bins}
      exit 0
    fi

    eukcc --debug folder \
        --improve_percent 10 \
        --n_combine 1 \
        --threads {threads} \
        --improve_ratio     5 \
        --links {input.links} \
        --min_links 100 \
        --suffix .fa \
        --db {input.db} \
        --out {params.out} \
        --prefix {wildcards.project}_{wildcards.sample}_{wildcards.binner}_merged. \
        {input.bindir}
    rm -r {params.out}/refine_workdir || true
    """

SnakeMake eukcc From line 93 of rules/eukcc.smk

shell:
    """
    rm -r {params.out}workdir || true
    eukcc --debug single \
        --threads {threads} \
        --db {input.db} \
        --out {params.out} \
        {input.fa}
    cp {output.tsv} {output.csv}
    """

SnakeMake eukcc From line 131 of rules/eukcc.smk

import pysam
from Bio import SeqIO
from collections import defaultdict
import os
import argparse
import logging
import csv


def is_in(read, contig_map, within=1000):
    if read.reference_name not in contig_map.keys():
        return False
    if read.reference_start <= within or read.reference_end <= within:
        return True
    elif read.reference_start > (
        contig_map[read.reference_name] - within
    ) or read.reference_end > (contig_map[read.reference_name] - within):
        return True
    else:
        return False


def keep_read(read, contig_map, within=1000, min_ANI=98, min_cov=0):
    ani = (
        (read.query_alignment_length - read.get_tag("NM"))
        / float(read.query_alignment_length)
        * 100
    )
    cov = read.query_alignment_length / float(read.query_length) * 100

    if ani >= min_ANI and cov >= min_cov and is_in(read, contig_map, within) is True:
        return True
    else:
        return False


def contig_map(bindir, suffix=".fa"):
    m = {}
    for f in os.listdir(bindir):
        if f.endswith(suffix) is False:
            continue
        path = os.path.join(bindir, f)
        with open(path, "r") as handle:
            for record in SeqIO.parse(handle, "fasta"):
                m[record.name] = len(record.seq)
    return m


def bin_map(bindir, suffix=".fa"):
    contigs = defaultdict(str)
    contigs_per_bin = defaultdict(int)
    for f in os.listdir(bindir):
        if f.endswith(suffix) is False:
            continue
        path = os.path.join(bindir, f)
        binname = os.path.basename(f)
        with open(path, "r") as handle:
            for record in SeqIO.parse(handle, "fasta"):
                contigs[record.name] = binname
                contigs_per_bin[binname] += 1
    return contigs, contigs_per_bin


def read_pair_generator(bam):
    """
    Generate read pairs in a BAM file or within a region string.
    Reads are added to read_dict until a pair is found.
    From: https://www.biostars.org/p/306041/
    """
    read_dict = defaultdict(lambda: [None, None])
    for read in bam.fetch():
        if not read.is_paired or read.is_secondary or read.is_supplementary:
            continue
        qname = read.query_name
        if qname not in read_dict:
            if read.is_read1:
                read_dict[qname][0] = read
            else:
                read_dict[qname][1] = read
        else:
            if read.is_read1:
                yield read, read_dict[qname][1]
            else:
                yield read_dict[qname][0], read
            del read_dict[qname]


def main():
    # set arguments
    # arguments are passed to classes
    parser = argparse.ArgumentParser(
        description="Evaluate completeness and contamination of a MAG."
    )
    parser.add_argument("bindir", type=str, help="Run script on these bins")
    parser.add_argument(
        "bam",
        type=str,
        help="Bam with allr eads aligned against all contigs making up the bins",
    )
    parser.add_argument(
        "--out",
        "-o",
        type=str,
        required=False,
        help="Path to output table (Default: links.csv)",
        default="links.csv",
    )
    parser.add_argument(
        "--ANI", type=float, required=False, help="ANI of matching read", default=99
    )
    parser.add_argument(
        "--within",
        type=int,
        required=False,
        help="Within this many bp we need the read to map",
        default=1000,
    )
    parser.add_argument(
        "--contigs",
        "-c",
        action="store_true",
        default=False,
        help="Instead of bins print contigs",
    )
    parser.add_argument(
        "--quiet",
        "-q",
        dest="quiet",
        action="store_true",
        default=False,
        help="Silcence most output",
    )
    parser.add_argument(
        "--debug",
        "-d",
        action="store_true",
        default=False,
        help="Debug and thus ignore safety",
    )
    args = parser.parse_args()

    # define logging
    logLevel = logging.INFO
    if args.quiet:
        logLevel = logging.WARNING
    elif args.debug:
        logLevel = logging.DEBUG
    logging.basicConfig(
        format="%(asctime)s %(message)s",
        datefmt="%d-%m-%Y %H:%M:%S: ",
        level=logLevel,
    )

    bindir = args.bindir
    samfile = pysam.AlignmentFile(args.bam, "rb")

    cm = contig_map(bindir)
    bm, contigs_per_bin = bin_map(bindir)

    link_table = defaultdict(lambda: defaultdict(int))
    bin_table = defaultdict(lambda: defaultdict(int))

    # generate link table
    logging.info("Parsing Bam file. This can take a few moments")
    for read, mate in read_pair_generator(samfile):
        if keep_read(read, cm, args.within, min_ANI=args.ANI) and keep_read(
            mate, cm, args.within, min_ANI=args.ANI
        ):
            # fill in the table
            link_table[read.reference_name][mate.reference_name] += 1
            if read.reference_name != mate.reference_name:
                link_table[mate.reference_name][read.reference_name] += 1

    # generate bin table
    for contig_1, dic in link_table.items():
        for contig_2, links in dic.items():
            bin_table[bm[contig_1]][bm[contig_2]] += 1
    out_data = []
    logging.debug("Constructing output dict")
    if args.contigs:
        for contig_1, linked in link_table.items():
            for contig_2, links in linked.items():
                out_data.append(
                    {
                        "bin_1": bm[contig_1],
                        "bin_2": bm[contig_2],
                        "contig_1": contig_1,
                        "contig_2": contig_2,
                        "links": links,
                        "bin_1_contigs": contigs_per_bin[bm[contig_1]],
                        "bin_2_contigs": contigs_per_bin[bm[contig_2]],
                    }
                )
    else:
        for bin_1, dic in bin_table.items():
            for bin_2, links in dic.items():
                out_data.append({"bin_1": bin_1, "bin_2": bin_2, "links": links})
    # results
    logging.info("Writing output")
    with open(args.out, "w") as fout:
        cout = csv.DictWriter(fout, fieldnames=list(out_data[0].keys()))
        cout.writeheader()
        for row in out_data:
            cout.writerow(row)


if __name__ == "__main__":
    main()

Python Biopython pysam From line 2 of scripts/binlinks.py

import os
import argparse
import subprocess
import logging
import tempfile
import gzip
import glob
import csv


# backup fasta handler, so we can use readonly directories
class fa_class:
    def __init__(self, seq, name, long_name):
        self.seq = seq
        self.name = name
        self.long_name = long_name

    def __str__(self):
        return self.seq

    def __len__(self):
        return len(self.seq)


def Fasta(path):
    """
    Iterator for fasta files
    """
    entry = False

    with open(path) as fin:
        for line in fin:
            if line.startswith(">"):
                if entry is not False:
                    entry.seq = "".join(entry.seq)
                    yield entry
                # define new entry
                long_name = line.strip()[1:]
                name = long_name.split()[0]
                entry = fa_class([], name, long_name)
            else:
                entry.seq.append(line.strip())
        # yield last one
        entry.seq = "".join(entry.seq)
        yield entry


def N50(l, x=0.5):
    l = sorted(l)
    l.reverse()
    t = sum(l) * x
    for i, x in enumerate(l):
        if i == 0:
            cs = x
        else:
            cs = cs + x
        if cs >= t:
            return x


def genome_stats(path):
    lengths = [len(x) for x in Fasta(path)]
    return {"N50": N50(lengths), "bp": sum(lengths), "contigs": len(lengths)}


def eukcc_score(path):
    with open(path) as fin:
        for row in csv.DictReader(fin, delimiter="\t"):
            return row


def busco_score(path):
    d = {
        "BUSCO_C": None,
        "BUSCO_M": None,
        "BUSCO_D": None,
        "BUSCO_S": None,
        "BUSCO_F": None,
    }

    with open(path) as fin:
        for line in fin:
            if line.strip().startswith("C:"):
                elem = line.strip().replace("[", ",").replace("]", "").split(",")
                for e in elem:
                    x = e.split(":")
                    d["BUSCO_{}".format(x[0])] = x[1].replace("%", "")
                # this is the last info we need
                break
            if line.strip().startswith("# The lineage dataset is: "):
                d["BUSCO_lineage"] = line.strip().split()[5]
    return d


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--output", help="path for the output table", default="qc.csv", type=str
    )
    parser.add_argument("--mags", help="path to mag folder", type=str)
    parser.add_argument("--qc_dir", help="path to qc folder", type=str)
    parser.add_argument(
        "--rerun",
        action="store_true",
        help="rerun even if output exists",
        default=False,
    )
    parser.add_argument(
        "--quiet", action="store_true", help="supress information", default=False
    )
    parser.add_argument(
        "--debug", action="store_true", help="Make it more verbose", default=False
    )

    args = parser.parse_args()

    # define logging
    logLevel = logging.INFO
    if args.quiet:
        logLevel = logging.WARNING
    elif args.debug:
        logLevel = logging.DEBUG
    logging.basicConfig(
        format="%(asctime)s %(message)s", datefmt="%d/%m/%Y %H:%M:%S: ", level=logLevel
    )

    # find mags
    mags = glob.glob("{}/*.fa".format(args.mags))

    # for each mag learn eukcc and BUSCO values
    # also determine N50
    #
    fields = [
        "fasta",
        "completeness",
        "contamination",
        "BUSCO_C",
        "BUSCO_M",
        "BUSCO_D",
        "BUSCO_S",
        "BUSCO_F",
        "BUSCO_n",
        "BUSCO_lineage",
        "N50",
        "bp",
        "contigs",
    ]

    with open(args.output, "w") as outfile:
        cout = csv.DictWriter(outfile, fieldnames=fields, extrasaction="ignore")
        cout.writeheader()

        for mag in mags:
            name = os.path.basename(mag).replace(".fa", "")
            # learn eukcc score
            eukcc_p = os.path.join(args.qc_dir, "eukcc", name, "eukcc.csv")
            busco_p = os.path.join(
                args.qc_dir, "busco", name, "short_summary.specific.txt"
            )
            stat = genome_stats(mag)
            eukcc = eukcc_score(eukcc_p)
            stat = {**stat, **eukcc}
            stat["fasta"] = os.path.basename(mag)
            busco = busco_score(busco_p)
            stat = {**stat, **busco}
            cout.writerow(stat)

Python BUSCO eukcc From line 19 of scripts/create_qc_table.py

import argparse
import sys
import random
import re
bases = {"R": ["A", "G"],
         "Y": ["C", "T"],
         "K": ["G", "T"],
         "M": ["A", "C"],
         "S": ["C", "G"],
         "W": ["A", "T"],
         "B": ["C", "G", "T"],
         "D": ["A", "G", "T"],
         "H": ["A", "C", "T"],
         "V": ["A", "C", "G"],
         "N": ["A", "G", "C", "T"]
         }

dna = re.compile("[^AGCT]{1}")

vbases = ["A","C","G","T"]

def ren_fasta(args):
    n = 0
    for line in open(args.fasta_file, "r"):
        if line[0] == ">":
            name = line.strip("\n").replace(">","")
            n += 1
            if args.clean and args.prefix is not None:
                print(">%s_%i" % (args.prefix, n))
            elif args.clean and args.prefix is None:
                print(">%s" % (name.split()[0]))
            else:
                print(">%s_%i\t%s" % (args.prefix, n, name))
        else:
            print(line.strip("\n"))

if __name__ == '__main__':
        parser = argparse.ArgumentParser(description='Rename multifasta file')
        parser.add_argument('-f', dest='fasta_file', help='Input FASTA file')
        parser.add_argument('-p', dest='prefix', help='Header prefix', default=None)
        parser.add_argument('--clean', help="Retain only new header (default: False)", default=False, action="store_true")
        if len(sys.argv) == 1:
            parser.print_help()
            sys.exit()
        else:
            args = parser.parse_args()
            ren_fasta(args) 

Python From line 3 of scripts/rename_multifasta_prefix.py

shell:
    """
    mkdir -p {output}
    cp {input} {output}
    """

SnakeMake From line 108 of main/Snakefile

shell:
    """
    touch {output}
    """

SnakeMake From line 119 of main/Snakefile

run:
    import shutil
    import glob
    import os
    os.makedirs(output.fa, exist_ok=True)
    with open(output.genomes, "w") as fout:
      for folder in input.dirs:
        for fa in glob.glob("{}/*.fa".format(folder)):
            shutil.copy(fa, output.fa)
            fout.write("{}\n".format(os.path.basename(fa)))
    with open(output.quality, "w") as fout:
        cout = csv.DictWriter(fout, fieldnames=['genome','completeness', 'contamination'])
        cout.writeheader()
        for f in input.quality:
            with open(f) as fin:
                for row in csv.DictReader(fin):
                    cout.writerow(row)

SnakeMake From line 133 of main/Snakefile

shell:
    """
    python3 scripts/rename_multifasta_prefix.py --clean \
        -p {wildcards.MAG} \
        -f {input} > {output}
    """

SnakeMake From line 156 of main/Snakefile

shell:
    """
    python3 scripts/create_qc_table.py \
        --mags {input.raw} \
        --qc_dir {wildcards.data}/qc \
        --output {output.csv}
    """