EukRecover: Pipeline to recover eukaryotic MAGs using CONCOCT, metaBAT2 and EukCC's merging algorythm.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
EukRecover
Pipeline to recover eukaryotic MAGs using CONCOCT, metaBAT2 and EukCC's merging algorythm.
Needs paired end shotgun metagenomic reads.
Environment
Eukrecover requires an environment with snakemake and metaWRAP.
Quickstart
Define your samples in the file
samples.csv
. This file needs to have the columns project and run to identify each metagenome.
This pipeline does not support co-binning, but feel free to change it.
Clone this repro wherever you want to run the pipeline:
git clone https://github.com/openpaul/eukrecover/
You can then run the snakemake like so
snakemake --use-singularity
The pipeline used dockerhub to fetch all tools, so make sure you have singularity installed.
Prepare databases
The pipeline will setup databases for you, but if you already have a EukCC or a BUSCO 5 database you can use them by specifying the location in the file
config/config.yaml
Output:
In the folder results you will find a folder
MAGs
which will contain a folder
fa
containing the actual MAG fastas. In addition you will find stats for each MAG in the table
QC.csv
.
This table contains the following columns:
name,eukcc_compl,eukcc_cont,BUSCO_C,BUSCO_M,BUSCO_D,BUSCO_F,BUSCO_tax,N50,bp
Code Snippets
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | shell: """ gunzip -c {input.fq1} > {output.fq1} gunzip -c {input.fq2} > {output.fq2} metawrap binning \ -o {params.out} \ -t {threads} \ -l {config[mincontiglength]} \ -m {resources.mem_GB} \ -a {input.scaffolds} \ --concoct --metabat2 --maxbin2 {output.fq1} {output.fq2} # rename bins with name cd {params.out}concoct_bins for f in *.fa ; do mv -- "$f" "{wildcards.project}_{wildcards.sample}_concoct_$f" || true ; done cd {params.out}metabat2_bins for f in *.fa ; do mv -- "$f" "{wildcards.project}_{wildcards.sample}_metabat2_$f" || true ; done cd {params.out}maxbin2_bins for f in *.fa ; do mv -- "$f" "{wildcards.project}_{wildcards.sample}_maxbin2_$f" || true ; done # remove unbinned bins cd {params.out} mkdir unbinned mv *_bins/*unbinned.fa unbinned/ || true # remove temp files cd {params.out} rm -r work_files """ |
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | shell: """ #rm -r {params.out} #mkdir -p {params.out} metawrap bin_refinement -o {params.out} \ -A {input.concoct} -B {input.metabat2} \ -m {resources.mem_GB} \ -t {threads}\ -c 50 -x 10 \ --quick || true mv {params.d} {output.out} cd {params.out}metawrap_bins for f in *.fa ; do mv -- "$f" "{wildcards.project}_{wildcards.sample}_metawrap_$f" || true ; done """ |
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | shell: """ mkdir -p {params.d} cd {params.d} wget -r -np -R "index.html*" https://busco-data.ezlab.org/v5/data/ mv busco-data.ezlab.org/v5/data/* . rm -r busco-data.ezlab.org #extract all gzipped files cd lineages find . -type f -iname "*tar.gz" -exec rm "{{}}" \; cd ../placement_files find . -type f -iname "*tar.gz" -exec rm "{{}}" \; cd ../information find . -type f -iname "*tar.gz" -exec rm "{{}}" \; """ |
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | shell: """ FNA=$(realpath {input.fa}) DB=$(realpath {input.db}) mkdir -p {params.d} if [ ! -z "$(ls -A {params.d})" ]; then echo "Folder not empty. Deleting" rm -rf {params.d} fi mkdir -p {params.d} cd {params.d} busco --offline \ -i $FNA \ -m genome \ -o out \ --auto-lineage-euk \ --download_path $DB \ -c {threads} || touch out/short_summary.specific_failed.out.txt cp out/short_summary.specific*.out.txt short_summary.specific.txt rm -r out """ |
25 26 27 28 29 30 31 32 33 34 35 36 37 | shell: """ minimap2 -ax sr -t {threads} {input.fasta} {input.fq1} {input.fq2} | samtools view -q 20 -Sb - | \ samtools sort -@ {threads} -O bam - -o {output.bam} samtools index {output.bam} # cmseq breadth_depth.py \ --combine \ --mincov {params.mincov} \ {output.bam} > {output.cov} """ |
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | shell: """ mkdir -p {params.out} rm -r {params.out} mkdir -p {params.out} BD=$(pwd) # prepare drep quality file echo "genome,completeness,contamination" > {output.quality} cat {input.concoct_eukcc} {input.metabat2_eukcc}|\ awk '{{if($2 - 5*$3 >=50){{print $0}}}}' |\ sort -k 2,3 -n |\ tr '\\t' ',' >> {output.quality} if [[ $(wc -l <{output.quality}) -lt 2 ]]; then touch {output.quality} touch {output.genomes} mkdir -p {output.input_bins} || true mkdir -p {output.bins} || true exit 0 fi cat {output.quality} mkdir {output.input_bins} cd {output.input_bins} ln -s {input.concoct}/*.fa . || true ln -s {input.concoct_merged}/*.fa . || true ln -s {input.metabat2}/*.fa . || true ln -s {input.metabat2_merged}/*.fa . || true ls ls > ../genomes.txt cd .. cat quality.csv | grep -v "genome,completeness,contaminatio" |\ cut -f 1 -d , > .f grep -f .f genomes.txt | sed -e 's/^/input_genomes\//' > .g mv .g genomes.txt rm .f cd $BD mkdir -p {params.out} cd {params.out} cat genomes.txt if [[ $(wc -l <genomes.txt) -lt 2 ]]; then echo "No genomes" cp -r input_genomes dereplicated_genomes else dRep dereplicate -p {threads} \ . -g genomes.txt \ -pa 0.80 -sa 0.99 -nc 0.40 \ -cm larger \ --genomeInfo quality.csv\ -comp 49 -con 21 fi """ |
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | shell: """ mkdir -p {params.out} rm -r {params.out} mkdir -p {params.out} BD=$(pwd) cat {input.genomes} | sed -e 's/^/..\/raw\//' > {output.genomes} mkdir -p {params.out} cd {params.out} N=$(wc -l genomes.txt) if [[ $(wc -l <genomes.txt) -lt 2 ]]; then cp -r {input.fa} dereplicated_genomes else dRep dereplicate -p {threads} \ . -g genomes.txt \ -pa 0.80 -sa 0.95 -nc 0.40 \ -cm larger \ --genomeInfo {input.quality}\ -comp 49 -con 21 fi ls dereplicated_genomes > MAGs.txt """ |
21 22 23 24 25 26 27 | shell: """ minimap2 -ax sr -t {threads} {input.scaffolds} {input.fq1} {input.fq2} | samtools view -q 20 -Sb - | \ samtools sort -@ {threads} -O bam - -o {output.bam} samtools index {output.bam} """ |
43 44 45 46 47 48 49 50 51 52 | shell: """ # if no bins if [[ $( ls {input.bindir}/*.fa | wc -l ) -lt 1 ]]; then touch {output.links} else python3 scripts/binlinks.py --ANI 99 --within 1500 \ --out {output.links} {input.bindir} {input.bam} fi """ |
64 65 66 67 68 69 | shell: """ wget -O {params.tar} http://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc2_db_ver_1.1.tar.gz cd {params.d} tar -xzvf eukcc2_db_ver_1.1.tar.gz """ |
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | shell: """ rm -r {params.out}workdir || true # if no bins if [[ $( ls {input.bindir} | wc -l ) -lt 1 ]]; then touch {output.csv} mkdir -p {output.bins} exit 0 fi eukcc --debug folder \ --improve_percent 10 \ --n_combine 1 \ --threads {threads} \ --improve_ratio 5 \ --links {input.links} \ --min_links 100 \ --suffix .fa \ --db {input.db} \ --out {params.out} \ --prefix {wildcards.project}_{wildcards.sample}_{wildcards.binner}_merged. \ {input.bindir} rm -r {params.out}/refine_workdir || true """ |
131 132 133 134 135 136 137 138 139 140 | shell: """ rm -r {params.out}workdir || true eukcc --debug single \ --threads {threads} \ --db {input.db} \ --out {params.out} \ {input.fa} cp {output.tsv} {output.csv} """ |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | import pysam from Bio import SeqIO from collections import defaultdict import os import argparse import logging import csv def is_in(read, contig_map, within=1000): if read.reference_name not in contig_map.keys(): return False if read.reference_start <= within or read.reference_end <= within: return True elif read.reference_start > ( contig_map[read.reference_name] - within ) or read.reference_end > (contig_map[read.reference_name] - within): return True else: return False def keep_read(read, contig_map, within=1000, min_ANI=98, min_cov=0): ani = ( (read.query_alignment_length - read.get_tag("NM")) / float(read.query_alignment_length) * 100 ) cov = read.query_alignment_length / float(read.query_length) * 100 if ani >= min_ANI and cov >= min_cov and is_in(read, contig_map, within) is True: return True else: return False def contig_map(bindir, suffix=".fa"): m = {} for f in os.listdir(bindir): if f.endswith(suffix) is False: continue path = os.path.join(bindir, f) with open(path, "r") as handle: for record in SeqIO.parse(handle, "fasta"): m[record.name] = len(record.seq) return m def bin_map(bindir, suffix=".fa"): contigs = defaultdict(str) contigs_per_bin = defaultdict(int) for f in os.listdir(bindir): if f.endswith(suffix) is False: continue path = os.path.join(bindir, f) binname = os.path.basename(f) with open(path, "r") as handle: for record in SeqIO.parse(handle, "fasta"): contigs[record.name] = binname contigs_per_bin[binname] += 1 return contigs, contigs_per_bin def read_pair_generator(bam): """ Generate read pairs in a BAM file or within a region string. Reads are added to read_dict until a pair is found. From: https://www.biostars.org/p/306041/ """ read_dict = defaultdict(lambda: [None, None]) for read in bam.fetch(): if not read.is_paired or read.is_secondary or read.is_supplementary: continue qname = read.query_name if qname not in read_dict: if read.is_read1: read_dict[qname][0] = read else: read_dict[qname][1] = read else: if read.is_read1: yield read, read_dict[qname][1] else: yield read_dict[qname][0], read del read_dict[qname] def main(): # set arguments # arguments are passed to classes parser = argparse.ArgumentParser( description="Evaluate completeness and contamination of a MAG." ) parser.add_argument("bindir", type=str, help="Run script on these bins") parser.add_argument( "bam", type=str, help="Bam with allr eads aligned against all contigs making up the bins", ) parser.add_argument( "--out", "-o", type=str, required=False, help="Path to output table (Default: links.csv)", default="links.csv", ) parser.add_argument( "--ANI", type=float, required=False, help="ANI of matching read", default=99 ) parser.add_argument( "--within", type=int, required=False, help="Within this many bp we need the read to map", default=1000, ) parser.add_argument( "--contigs", "-c", action="store_true", default=False, help="Instead of bins print contigs", ) parser.add_argument( "--quiet", "-q", dest="quiet", action="store_true", default=False, help="Silcence most output", ) parser.add_argument( "--debug", "-d", action="store_true", default=False, help="Debug and thus ignore safety", ) args = parser.parse_args() # define logging logLevel = logging.INFO if args.quiet: logLevel = logging.WARNING elif args.debug: logLevel = logging.DEBUG logging.basicConfig( format="%(asctime)s %(message)s", datefmt="%d-%m-%Y %H:%M:%S: ", level=logLevel, ) bindir = args.bindir samfile = pysam.AlignmentFile(args.bam, "rb") cm = contig_map(bindir) bm, contigs_per_bin = bin_map(bindir) link_table = defaultdict(lambda: defaultdict(int)) bin_table = defaultdict(lambda: defaultdict(int)) # generate link table logging.info("Parsing Bam file. This can take a few moments") for read, mate in read_pair_generator(samfile): if keep_read(read, cm, args.within, min_ANI=args.ANI) and keep_read( mate, cm, args.within, min_ANI=args.ANI ): # fill in the table link_table[read.reference_name][mate.reference_name] += 1 if read.reference_name != mate.reference_name: link_table[mate.reference_name][read.reference_name] += 1 # generate bin table for contig_1, dic in link_table.items(): for contig_2, links in dic.items(): bin_table[bm[contig_1]][bm[contig_2]] += 1 out_data = [] logging.debug("Constructing output dict") if args.contigs: for contig_1, linked in link_table.items(): for contig_2, links in linked.items(): out_data.append( { "bin_1": bm[contig_1], "bin_2": bm[contig_2], "contig_1": contig_1, "contig_2": contig_2, "links": links, "bin_1_contigs": contigs_per_bin[bm[contig_1]], "bin_2_contigs": contigs_per_bin[bm[contig_2]], } ) else: for bin_1, dic in bin_table.items(): for bin_2, links in dic.items(): out_data.append({"bin_1": bin_1, "bin_2": bin_2, "links": links}) # results logging.info("Writing output") with open(args.out, "w") as fout: cout = csv.DictWriter(fout, fieldnames=list(out_data[0].keys())) cout.writeheader() for row in out_data: cout.writerow(row) if __name__ == "__main__": main() |
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | import os import argparse import subprocess import logging import tempfile import gzip import glob import csv # backup fasta handler, so we can use readonly directories class fa_class: def __init__(self, seq, name, long_name): self.seq = seq self.name = name self.long_name = long_name def __str__(self): return self.seq def __len__(self): return len(self.seq) def Fasta(path): """ Iterator for fasta files """ entry = False with open(path) as fin: for line in fin: if line.startswith(">"): if entry is not False: entry.seq = "".join(entry.seq) yield entry # define new entry long_name = line.strip()[1:] name = long_name.split()[0] entry = fa_class([], name, long_name) else: entry.seq.append(line.strip()) # yield last one entry.seq = "".join(entry.seq) yield entry def N50(l, x=0.5): l = sorted(l) l.reverse() t = sum(l) * x for i, x in enumerate(l): if i == 0: cs = x else: cs = cs + x if cs >= t: return x def genome_stats(path): lengths = [len(x) for x in Fasta(path)] return {"N50": N50(lengths), "bp": sum(lengths), "contigs": len(lengths)} def eukcc_score(path): with open(path) as fin: for row in csv.DictReader(fin, delimiter="\t"): return row def busco_score(path): d = { "BUSCO_C": None, "BUSCO_M": None, "BUSCO_D": None, "BUSCO_S": None, "BUSCO_F": None, } with open(path) as fin: for line in fin: if line.strip().startswith("C:"): elem = line.strip().replace("[", ",").replace("]", "").split(",") for e in elem: x = e.split(":") d["BUSCO_{}".format(x[0])] = x[1].replace("%", "") # this is the last info we need break if line.strip().startswith("# The lineage dataset is: "): d["BUSCO_lineage"] = line.strip().split()[5] return d if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument( "--output", help="path for the output table", default="qc.csv", type=str ) parser.add_argument("--mags", help="path to mag folder", type=str) parser.add_argument("--qc_dir", help="path to qc folder", type=str) parser.add_argument( "--rerun", action="store_true", help="rerun even if output exists", default=False, ) parser.add_argument( "--quiet", action="store_true", help="supress information", default=False ) parser.add_argument( "--debug", action="store_true", help="Make it more verbose", default=False ) args = parser.parse_args() # define logging logLevel = logging.INFO if args.quiet: logLevel = logging.WARNING elif args.debug: logLevel = logging.DEBUG logging.basicConfig( format="%(asctime)s %(message)s", datefmt="%d/%m/%Y %H:%M:%S: ", level=logLevel ) # find mags mags = glob.glob("{}/*.fa".format(args.mags)) # for each mag learn eukcc and BUSCO values # also determine N50 # fields = [ "fasta", "completeness", "contamination", "BUSCO_C", "BUSCO_M", "BUSCO_D", "BUSCO_S", "BUSCO_F", "BUSCO_n", "BUSCO_lineage", "N50", "bp", "contigs", ] with open(args.output, "w") as outfile: cout = csv.DictWriter(outfile, fieldnames=fields, extrasaction="ignore") cout.writeheader() for mag in mags: name = os.path.basename(mag).replace(".fa", "") # learn eukcc score eukcc_p = os.path.join(args.qc_dir, "eukcc", name, "eukcc.csv") busco_p = os.path.join( args.qc_dir, "busco", name, "short_summary.specific.txt" ) stat = genome_stats(mag) eukcc = eukcc_score(eukcc_p) stat = {**stat, **eukcc} stat["fasta"] = os.path.basename(mag) busco = busco_score(busco_p) stat = {**stat, **busco} cout.writerow(stat) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | import argparse import sys import random import re bases = {"R": ["A", "G"], "Y": ["C", "T"], "K": ["G", "T"], "M": ["A", "C"], "S": ["C", "G"], "W": ["A", "T"], "B": ["C", "G", "T"], "D": ["A", "G", "T"], "H": ["A", "C", "T"], "V": ["A", "C", "G"], "N": ["A", "G", "C", "T"] } dna = re.compile("[^AGCT]{1}") vbases = ["A","C","G","T"] def ren_fasta(args): n = 0 for line in open(args.fasta_file, "r"): if line[0] == ">": name = line.strip("\n").replace(">","") n += 1 if args.clean and args.prefix is not None: print(">%s_%i" % (args.prefix, n)) elif args.clean and args.prefix is None: print(">%s" % (name.split()[0])) else: print(">%s_%i\t%s" % (args.prefix, n, name)) else: print(line.strip("\n")) if __name__ == '__main__': parser = argparse.ArgumentParser(description='Rename multifasta file') parser.add_argument('-f', dest='fasta_file', help='Input FASTA file') parser.add_argument('-p', dest='prefix', help='Header prefix', default=None) parser.add_argument('--clean', help="Retain only new header (default: False)", default=False, action="store_true") if len(sys.argv) == 1: parser.print_help() sys.exit() else: args = parser.parse_args() ren_fasta(args) |
108 109 110 111 112 | shell: """ mkdir -p {output} cp {input} {output} """ |
119 120 121 122 | shell: """ touch {output} """ |
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | run: import shutil import glob import os os.makedirs(output.fa, exist_ok=True) with open(output.genomes, "w") as fout: for folder in input.dirs: for fa in glob.glob("{}/*.fa".format(folder)): shutil.copy(fa, output.fa) fout.write("{}\n".format(os.path.basename(fa))) with open(output.quality, "w") as fout: cout = csv.DictWriter(fout, fieldnames=['genome','completeness', 'contamination']) cout.writeheader() for f in input.quality: with open(f) as fin: for row in csv.DictReader(fin): cout.writerow(row) |
156 157 158 159 160 161 | shell: """ python3 scripts/rename_multifasta_prefix.py --clean \ -p {wildcards.MAG} \ -f {input} > {output} """ |
200 201 202 203 204 205 206 | shell: """ python3 scripts/create_qc_table.py \ --mags {input.raw} \ --qc_dir {wildcards.data}/qc \ --output {output.csv} """ |
Support
- Future updates
Related Workflows





