metax
Metagenomic taxonomic classification benchmarking
Overview
This package is contains snakemake workflows, a jupyter notebook and a python package. This directory can be used as the snakemake workflow directory (the easiest setup), but you can also symlink in the
Snakefile
,
rules/
and
config.yaml
paths, to keep it separate from the data.
Organization
-
The
fastq/
folder should contain input fastq datasets compressed in.bz2
format. -
The
db/
folder contains metagenomic databases, which can be downloaded fromgs://metax-bakeoff-2019
usinggsutil
. -
The
src/
folder contains code for classifiers that don't have conda package available. Thesrc/local/
is a local path for installing packages akin to the/usr/local/
path, and contains compiled code (compatible with Ubuntu 16.04). Separate packages are installed undersrc/local/stow
if they are not installed to that prefix directly. Packages installed here may need to be recompiled on a different OS. All of the rest of the packages are installed viaconda
. There is one conda environmentmetax
based on python3 that serves as the default environment to use, while themetax_py2
environment is based on python2 needed by some packages, and will only be used to run specific classifiers. -
The
envs/
contains environment yaml files that fully spec these two conda environments, while theconda_environment*.txt
files are their human-editable counterparts similar torequirements.txt
files with only a subset of critical packages version pinned, and not including transitive dependencies. -
The
rules/
directory contains snakemake rules for each method. -
The
tmp/
folder is the default local folder for storing temporary files needed by various methods. -
The
log/
folder contains stdout/stderr logs. -
The
time/
folder contains/usr/bin/time -v
logs, -
The
benchmark/
folder contains snakemake benchmarking tsvs. -
The
data/
folder is intended for storing the raw outputs of classifers, typically with a filename like<sample_name>.<classifier>.<database>.tsv.gz
. These raw outputs are typically individual read classifications, so they areO(n)
in the size of the input fastqs and compressed. -
The
reports/
folder is intended for storing the taxonomy reports, which typically contain one line per taxa with their abundance or number of classified reads per sample. This file is much smaller andO(1)
in size compared to the inputs, and has a filename format similar todata/
. -
The
info/
andsummary/
folders contain output summary data for datasets, such as the*.total_reads.txt
files for number of reads in the input fastqs, as well as things likesummary/classified_count.tsv
for number of classified reads if they are not determinable from output reports. -
The
plotting/
folder contains thecompiled_reports.tsv
master tsv of all classification results generated bymetax compile
as well as a few other summary files. It also contains the jupyter notebook that will produce the final plots.
Configuration
A default
config.yaml
exists in
config/config.yaml
. This config assumes that everything in the workflow will be placed relative to this git directory root, including all of the databases, inputs, outputs, code and non-conda software. The config also specifies the locations of the various databases, so acquiring the database don't have to be downloaded from Google Cloud Storage, but this also means that downloading databases and running classifiers are separate
snakemake
invocations. Practical things to change in the config are
TMPDIR
which is specified as
tmp/
by default, but can be changed to another physical drive to improve performance.
Install
-
Install miniconda https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html . Install gsutil https://cloud.google.com/storage/docs/gsutil_install for downloading of already hosted databases.
-
Create two conda environments from the
envs/
directory.conda enc create -n metax --file envs/py3.yaml
. Alsoconda enc create -n metax_py2 --file envs/py2.yaml
. -
Activate the
metax
conda environment. -
Install this package itself as a python package.
pip install .
. -
Install remaining software into
src/
usingsnakemake download_src
. These packages may need to be recompiled.
Execution
-
Download any desired prerequisite databases using a snakemake rule. To download all databases, use
snakemake download_db_all
. To download a specific database, use the general formsnakemake db/[default|refseqc]/[method]
, such assnakemake db/refseqc/kraken
orsnakemake db/default/kraken
. -
Download existing fastq datasets using
snakemake download_fastq_all
. -
Run snakemake rules to classify samples using rules such as
snakemake kraken_default_all
. Alternatively all classifiers can be run on all samples by runningsnakemake all
, but note the caveats. [^1] -
Run the snakemake rule to aggregate reports, read counts, and classified counts
snakemake compile_reports
. This will generate theplotting/compiled_reports.tsv
file with all of the samples and classifications. -
Run the jupyter notebook
Metagenomics Bench.ipynb
to generate all plots.
Adding Classifiers
To add a new classifier to compare against the existing results, the easiest way would be to append the new method's classification results to the master
plotting/compiled_reports.tsv
file, and then simply run the jupyter notebook. Note that this would not add the compute time/resources benchmark. The columns in this file are:
-
sample
: Sample name -
classifier
: Classifier name -
database
: Name of the database, usually 'default' or 'refseqc'. -
taxid
: NCBI Taxonomy ID identified. -
cum_abundance
: Cumulative abundance (out of 1) of classified reads assigned to this taxid and its children. -
cls_cum_abundance
: Cumulative abundance (out of total classified abundance). -
unique_abundance
: Specific abundance classfied to this taxid not including children. -
cls_unique_abundance
: Specific abundance (out of total classified abundance). -
classrank
: The specific taxonomy rank profile level setting of the classifier. Either 'species', 'genus', or 'all' if the classifier assigns abundances to all levels of the taxonomy simultaneously. -
rank
: Taxonomy rank e.g. 'species', 'genus' etc.
Because the NCBI taxonomy tree and ids are constantly changing, you should use the same taxonomy database used for all other methods which can be acquired from
snakemake db/taxonomy
.
To add additional classifier methods, the best way would be to mimic an existing method. Create a new snakemake file
rules/newmethod.smk
and add
include rules/newmethod.smk
to the master
Snakefile
. For examples see the other method Snakefiles like
kaiju.smk
. Ideally you will create snakemake rules for
snakemake newmethod_default_all
and
snakemake_refseqc_all
.
Parsing Taxonomic Reports
The parser for the raw classified reports is part of the
metax
python package, in the command
metax compile
. This command will process all of the heterogenous report files in the
reports/
folder and works by matching the filename to parser classes. This allows custom report format parsing code for different classifiers, but ultimately, each parser class has a
parse()
method that yields python dicts for each taxa identified in a fastq file. These dicts contain mandatory keys like
classifier
,
rank
,
taxid
etc., and are eventually turned into tsv rows, so that all of the rows (taxa * sample) are aggregated into a single harmonized output tsv for ingestion by a jupyter notebook.
In
metax/reporter.py
, create an additional
class NewmethodParser()
. The easiest is to subclass an existing method if your report output has a similar enough format to an existing method, such as Kraken-like outputs and simple tsvs. For more complicated formats, you will need specialized parsing code, and may require a full read of the report if things like abundance cannot be calculated via streaming. For example, here is the
MegablastParser
and the corresponding line where it is registered to the
Matchers
object.
class MegablastParser(KrakenParser): MATCH_GLOB = '*.megablast.*.txt' MATCH_RE = re.compile(r'(?P<fn_prefix>.*)\.megablast\.(?P<database>[^.]*)\.txt') CLASSIFIER = 'megablast' is_paired_count = False
class Matchers: @classmethod def create(cls, processor): ... m.add('megablast', MegablastParser(processor))
Jupyter Notebook Plots
After the
snakemake compile_reports
step, the
Metagenomics Bench.ipynb
notebook also needs to be modified to incorporate additional methods. The
classifiers
global variable in the notebook needs to have an additional entry with your method of
<classifier_codename>: <formal_classifier_name>
. Similarly, it needs to be added to
refseqc_classifiers
if it has an appropriate refseqc database and kept in
main_classifiers
if it is to be included on all plots.
[^1]: Snakemake doesn't manage batching when sharing a common resource easily, like a large metagenomic database wanting all of its dependent jobs run serially so it won't have to reload the database from memory. To alleviate this and improve performance, run separate executions of snakemake for each method, such as
snakemake kraken_default_all; snakemake kraken_refseqc_all
;
snakemake diamond_default_all
etc.2. Run snakemake rules to classify samples using rules such as
snakemake kraken_default_all
. Alternatively all classifiers can be run on all samples by running
snakemake all
, but note the caveats. [^1] 3. Run the snakemake rule to aggregate reports, read counts, and classified counts
snakemake compile_reports
. This will generate the
plotting/compiled_reports.tsv
file with all of the samples and classifications. 4. Run the jupyter notebook
plotting/Metagenomics Bench.ipynb
to generate all plots.
Adding Classifiers
To add additional classifier methods, the best way would be to mimic an existing method. Create a new snakemake file
rules/newmethod.smk
and add
include rules/newmethod.smk
to the master
Snakefile
. For examples see the other method Snakefiles like
kaiju.smk
. Ideally you will create snakemake rules for
snakemake newmethod_default_all
and
snakemake_refseqc_all
.
Parsing Taxonomic Reports
The parser for the raw classified reports is part of the
metax
python package, in the command
metax compile
. This command will process all of the heterogenous report files in the
reports/
folder and works by matching the filename to parser classes. This allows custom report format parsing code for different classifiers, but ultimately, each parser class has a
parse()
method that yields python dicts for each taxa identified in a fastq file. These dicts contain mandatory keys like
classifier
,
rank
,
taxid
etc., and are eventually turned into tsv rows, so that all of the rows (taxa * sample) are aggregated into a single harmonized output tsv for ingestion by a jupyter notebook.
In
metax/reporter.py
, create an additional
class NewmethodParser()
. The easiest is to subclass an existing method if your report output has a similar enough format to an existing method, such as Kraken-like outputs and simple tsvs. For more complicated formats, you will need specialized parsing code, and may require a full read of the report if things like abundance cannot be calculated via streaming. For example, here is the
MegablastParser
and the corresponding line where it is registered to the
Matchers
object.
class MegablastParser(KrakenParser): MATCH_GLOB = '*.megablast.*.txt' MATCH_RE = re.compile(r'(?P<fn_prefix>.*)\.megablast\.(?P<database>[^.]*)\.txt') CLASSIFIER = 'megablast' is_paired_count = False
class Matchers: @classmethod def create(cls, processor): ... m.add('megablast', MegablastParser(processor))
Jupyter Notebook Plots
After the
snakemake compile_reports
step, the
Metagenomics Bench.ipynb
notebook also needs to be modified to incorporate additional methods. The
classifiers
global variable in the notebook needs to have an additional entry with your method of
<classifier_codename>: <formal_classifier_name>
. Similarly, it needs to be added to
refseqc_classifiers
if it has an appropriate refseqc database and kept in
main_classifiers
if it is to be included on all plots.
Creating refseqc database
The source fastas for creating the refseqc databases are located at
gs://metax-bakeoff-2019/refseqc/fasta/genomic.fna.gz
and
gs://metax-bakeoff-2019/refseqc/fasta/protein.faa.gz
for DNA and protein databases respectively. Precomputed mappings of the accession to taxonomy IDs are located at
gs://metax-bakeoff-2019/refseqc/fasta/genomic.taxids
and
gs://metax-bakeoff-2019/refseqc/fasta/protein.taxids
.
Adding FASTQ datasets
Compress and place the fastq as
<sample_name>_1.fastq.bz2
and
<sample_name>_2.fastq.bz2
for the two paired ends and add the sample name to the
config.yaml
under
samples_pe
. If the file is single-ended, place as
<sample_name>.fastq.bz2
instead and add the sample name under
samples_se
instead. Also add an entry to
config/compile_config.yaml
under
samples
, mapping the
<sample_name>
to a dict of metadata. The only required keys are
sample:
indicating the desired output name in the
compiled_reports.tsv
and
paired: True
if the sample is paired-ends. Otherwise it is assumed to be single-ended.
[^1]: Snakemake doesn't manage batching when sharing a common resource easily, like a large metagenomic database wanting all of its dependent jobs run serially so it won't have to reload the database from memory. To alleviate this and improve performance, run separate executions of snakemake for each method, such as
snakemake kraken_default_all; snakemake kraken_refseqc_all
;
snakemake diamond_default_all
etc.
Code Snippets
87 88 89 90 91 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(AC_DIAMOND_SHELL, bench_record=bench_record) shell('truncate -s 0 {output}') |
98 99 100 101 102 | shell: ''' export TMPDIR={TMPDIR} metax blast-report --total-reads $(cat {input.total_reads}) --blast-report {output.report} --tax-dir {params.db} <(zstd -dc {input.data}) ''' |
7 8 9 10 11 12 13 | shell: ''' export LC_ALL=C TMPDIR={TMPDIR} echo {input} env PICARD="{PICARD}" ~/efs/src/metax/scripts/bam_to_fastq_paired.sh {output[0]} {output[1]} {input} ''' |
28 29 30 31 32 | shell: ''' export LC_ALL=C TMPDIR={TMPDIR} picard -Xmx100g -Djava.io.tmpdir={TMPDIR} FastqToSam {params.inargs} OUTPUT={output} SAMPLE_NAME={wildcards.seq} QUALITY_FORMAT=Standard MAX_RECORDS_IN_RAM=5000000 ''' |
86 87 88 89 | run: shell('{DROPCACHE}') shell('makeblastdb -in {input.fna} -dbtype nucl -out {params.out} -taxid_map <(cut -f1-2 {input.taxid_map}) -parse_seqids', bench_record=bench_record) |
156 157 158 159 | shell: ''' metax blast-report --total-reads $(cat {input.total_reads}) --blast-lca {output.lca} --blast-report {output.report} --tax-dir {params.db} {input.data} ''' |
36 37 38 39 40 41 42 43 | shell: ''' export PS1= source activate metax_py2 echo {input} /usr/bin/time -v -o {log.time} \ {params.exe} -i {input} -k {params.db} -o {output} -l G 2>&1 | tee {log.log} ''' |
62 63 64 65 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(BRACKEN_SHELL, bench_record=bench_record) |
77 78 79 80 81 82 83 84 85 86 87 88 89 90 | run: shell('''\ {DROPCACHE} ''') shell(r'''\ export PS1= source activate metax_py2 /usr/bin/time -v -o {log.time} \ kraken --db {params.kraken_dir} --fasta-input --threads 32 <( find -L {params.kraken_dir}/library \( -name "*.fna" -o -name "*.fa" -o -name "*.fasta" \) -exec cat {{}} + ) > {params.dir}/database.kraken /usr/bin/time -v -a -o {log.time} \ perl ~/efs/metax/Bracken/src/count-kmer-abundances.pl --db {params.kraken_dir} --read-length 150 {params.dir}/database.kraken > {params.dir}/database150mers.kraken_cnts /usr/bin/time -v -a -o {log.time} \ python ~/efs/metax/Bracken/src/generate_kmer_distribution.py -i {params.dir}/database150mers.kraken_cnts -o {params.dir}/KMER_DISTR.150.txt ''', bench_record=bench_record) |
57 58 59 60 61 62 | shell: ''' export PS1= source activate metax_py2 {params.exe} -x {params.db} <(pigz -dc {input}) > {output.kreport} ''' |
79 80 81 82 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(CENTRIFUGE_SHELL + CENTRIFUGE_REPORT_SHELL, bench_record=bench_record) |
94 95 96 97 98 99 100 101 102 103 104 105 | run: # shell('''\ # metax create-centrifuge-map {NCBI_GENOMES} # cat all-*.map > all.map # ''') shell('''\ {DROPCACHE} ''') shell('''\ export TMPDIR={TMPDIR} TMP_DIR=$(mktemp -d) |
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | shell: ''' : > {output.aux_input_se} FILES=({input.se}) for (( i=0; i<${{#FILES[@]}} ; i+=1 )) ; do echo "${{FILES[i]}}" >> {output.aux_input_se} done : > {output.aux_output} for i in {params.output_files_se}; do echo "$i" >> {output.aux_output} done /usr/bin/time -v -o {log.time} \ {params.exe} {params.db} -n {threads} -O {output.aux_input_se} -R {output.aux_output} 2>&1 | tee {log.log} : > {output.aux_input_pe[0]} : > {output.aux_input_pe[1]} FILES=({input.pe}) for (( i=0; i<${{#FILES[@]}} ; i+=2 )) ; do echo "${{FILES[i]}}" >> {output.aux_input_pe[0]} echo "${{FILES[i+1]}}" >> {output.aux_input_pe[1]} done : > {output.aux_output} for i in {params.output_files_pe}; do echo "$i" >> {output.aux_output} done /usr/bin/time -av -o {log.time} \ {params.exe} {params.db} -n {threads} -P {output.aux_input_pe[0]} {output.aux_input_pe[1]} -R {output.aux_output} 2>&1 | tee -a {log.log} ''' |
122 123 124 125 | shell: ''' {params.exe} -F {input} {params.db} > {output} ''' |
149 150 151 152 | shell: ''' {params.exe} -F <(zcat {input}) {params.db} > {output} ''' |
173 174 175 176 177 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(CLARK_SHELL + CLARK_BENCHMARK_ABUNDANCE_SHELL, bench_record=bench_record) shell('truncate -s 0 {output}') |
186 187 188 189 190 191 192 | run: shell('{DROPCACHE}') shell('''\ /usr/bin/time -v -o {log.time} \ {config[CLARK_OPT]}/set_targets.sh /mnt/metax/workflow/db/refseqc/clark custom {config[CLARK]} -D {params.dir}/custom_0 -T {params.dir}/targets.txt -O /mnt/metax/workflow/tmp/empty.fastq -R empty.results ''', bench_record=bench_record) |
201 202 203 204 205 206 207 208 | run: shell('{DROPCACHE}') shell('''\ echo "$(readlink -f {params.dir}/custom_0)" > {config[CLARK_OPT]}/.dbAddress # Must be in the CLARK install directory cd {config[CLARK_OPT]} {config[CLARK_OPT]}/buildSpacedDB.sh ''', bench_record=bench_record) |
81 82 83 84 85 86 | shell: ''' /usr/bin/time --verbose --append -o {log.time} \ {DIAMOND} blastx -q {input} --verbose -p {threads} -d {params.dmnd}{params.tmpdir}{params.blocksize}{params.index_chunks} --taxonmap {params.taxonmap} --taxonnodes {params.taxonnodes} --outfmt 102 -o {params.tax} 2>&1 | tee {log.log} {PIGZ} -9 {params.tax} ''' |
103 104 105 106 107 108 109 110 111 112 113 114 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell('''\ {params.input_args} | \ /usr/bin/time --verbose --append -o {log.time} \ {DIAMOND} blastx --verbose --log -p {threads} -d {params.dmnd}{params.tmpdir}{params.blocksize}{params.index_chunks} --taxonmap {params.taxonmap} --taxonnodes {params.taxonnodes} --outfmt 102 -o {output.tax} 2>&1 | tee {log.log} ''', bench_record=bench_record) shell('''\ metax taxid-report --output {output.report} --tax-dir {config[TAXONOMY_DB]} {output.tax} ''') |
120 121 122 123 | shell: ''' metax taxid-report --output {output} --tax-dir {config[TAXONOMY_DB]} <({PIGZ} -dc {input}) ''' |
129 130 131 132 133 | run: shell('{DROPCACHE}') shell('''\ {DIAMOND} makedb --in {input.faa} -d db/refseqc/diamond/refseqc ''', bench_record=bench_record) |
38 39 40 41 42 | shell: ''' mkdir {output} gsutil cp {input} - | pigz -d | tar x -C {output} ''' |
48 49 50 51 52 | shell: ''' mkdir {output} gsutil cp {input} - | zstd -d | tar x -C {output} ''' |
66 67 68 69 | shell: ''' gsutil cp {input} {output} ''' |
74 75 76 77 | shell: ''' gsutil cp {input} - | zstd -d | tar x ''' |
74 75 76 77 78 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(GOTTCHA_SHELL, bench_record=bench_record) shell('truncate -s 0 {output}') |
73 74 75 76 | shell: ''' {params.kaiju_table} -r species -e -p -v -t {params.db_dir}/nodes.dmp -n {params.db_dir}/names.dmp -o {output} <(pigz -dc {input}) ''' |
83 84 85 86 | shell: ''' {params.kaiju_table} -r genus -e -p -v -t {params.db_dir}/nodes.dmp -n {params.db_dir}/names.dmp -o {output} <(pigz -dc {input}) ''' |
104 105 106 107 108 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(KAIJU_SHELL + KAIJU_REPORT_SHELL, bench_record=bench_record) shell('truncate -s 0 {output}') |
118 119 120 121 122 123 124 125 126 127 128 129 | run: shell(r'''\ export LC_ALL=C join -t$'\t' <(zcat {input.faa} | fasta_formatter -t | sed 's/ /\t/' ) {input.a2t} | \ awk 'BEGIN {{FS=OFS="\t"}} {{printf(">%s\n%s\n", $4, $2)}}' > protein.faa ''') shell('''\ /mnt/metax/src/kaiju/bin/mkbwt -n {threads} -l 100000 -a protein -o proteins protein.faa /mnt/metax/src/kaiju/bin/mkfmi proteins ''', bench_record=bench_record) shell('rm proteins.sa proteins.bwt') |
50 51 52 53 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(KARP_SHELL, bench_record=bench_record) |
94 95 96 97 98 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(KRAKEN2_SHELL, bench_record=bench_record) shell('truncate -s 0 {output.reads}') |
110 111 112 113 114 115 116 117 118 119 120 121 | run: shell('''\ mkdir {params.dir}/taxonomy {params.dir}/library ln -sf {params.tax_db}/names.dmp {params.tax_db}/nodes.dmp {params.dir}/taxonomy {PIGZ} -fdc {params.tax_db}/accession2taxid/nucl_gb.accession2taxid.gz > {params.dir}/taxonomy/nucl_gb.accession2taxid {DROPCACHE} ''') shell('''\ kraken2-build --db {params.dir} --add-to-library {input.fna} kraken2-build --db {params.dir} --build --threads {threads} ''', bench_record=bench_record) |
71 72 73 74 75 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(KRAKENHLL_SHELL, bench_record=bench_record) shell('truncate -s 0 {output.reads}') |
87 88 89 90 91 92 93 94 95 96 97 98 99 100 | run: shell('''\ mkdir {params.dir}/taxonomy {params.dir}/library ln -s {TAXONOMY_DB}/names.dmp {TAXONOMY_DB}/nodes.dmp {params.dir}/taxonomy {PIGZ} -dc {TAXONOMY_DB}/accession2taxid/nucl_gb.accession2taxid.gz > {params.dir}/taxonomy/nucl_gb.accession2taxid cp {input.seqmap} {params.dir}/library/ {DROPCACHE} ''') shell('''\ {KRAKENHLL_BUILD} --db {params.dir} --add-to-library {input.fna} > {log.log} || true # For some reason returns 255 /usr/bin/time -v -o {log.time} \ {KRAKENHLL_BUILD} --db {params.dir} --build --threads {threads} --jellyfish-hash-size 20000000000 >> {log.log} ''', bench_record=bench_record) |
88 89 90 91 92 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(KRAKEN_SHELL, bench_record=bench_record) shell('truncate -s 0 {output}') |
104 105 106 107 108 109 110 111 112 113 114 115 116 | run: shell('''\ mkdir {params.dir}/taxonomy {params.dir}/library ln -sf {params.tax_db}/names.dmp {params.tax_db}/nodes.dmp {params.dir}/taxonomy {PIGZ} -fdc {params.tax_db}/accession2taxid/nucl_gb.accession2taxid.gz > {params.dir}/taxonomy/nucl_gb.accession2taxid {DROPCACHE} ''') shell('''\ kraken-build --db {params.dir} --add-to-library {input.fna} kraken-build --db {params.dir} --build --threads {threads} # --jellyfish-hash-size ''', bench_record=bench_record) |
64 65 66 67 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(KSLAM_SHELL, bench_record=bench_record) |
81 82 83 84 | shell: ''' echo "$(zcat {input} | awk '$2 != 0' | wc -l) * {params.paired}" | bc -l > {output} ''' |
95 96 97 98 99 100 101 102 103 104 105 | run: shell('''\ mkdir -p db/refseqc/kslam {DROPCACHE} ''') shell('''\ /usr/bin/time -v -o {log.time} \ {KSLAM} --parse-taxonomy {params.tax_db}/names.dmp {params.tax_db}/nodes.dmp --output-file taxDB | tee {log.log} /usr/bin/time -v -a -o {log.time} \ {KSLAM} --output-file {output.db} --parse-genbank {input.gbff} | tee -a {log.log} ''', bench_record=bench_record) |
77 78 79 80 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(METAOTHELLO_SHELL, bench_record=bench_record) |
86 87 88 89 | shell: ''' metax metaothello-report --tax-dir {params.db} <({PIGZ} -dc {input}) {output} ''' |
49 50 51 52 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(METAPHLAN2_SHELL, bench_record=bench_record) |
58 59 60 61 | shell: ''' cut -f1 {input} | uniq | wc -l > {output} ''' |
96 97 98 99 100 101 | run: if benchmark_i == 0: shell('{DROPCACHE}') else: shell('rm {params.query_lca_db} {params.query_db}') shell(MMSEQS2_SHELL, bench_record=bench_record) |
108 109 110 111 | shell: ''' metax mmseqs-report {input.data} {output} --total-reads $(cat {input.total_reads}) --tax-dir {params.db} ''' |
122 123 124 125 126 127 128 129 | run: shell('''\ /usr/bin/time -v -o {log.time} \ {MMSEQS2} createdb {input.faa} {params.dir}/refseqc.db ''', bench_record=bench_record) # shell(''' # metax create-mmseqs-taxonomy -o {params.dir}/refseqc.tsv --tax-db {TAXONOMY_DB} --accession2taxid {input.prot} {input.dead_prot} -- {params.dir}/refseqc.db_h # ''') |
49 50 51 52 53 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(MOTUS_SHELL, bench_record=bench_record) shell('truncate -s 0 {output}') |
43 44 45 46 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(PATHSEQ_SHELL, bench_record=bench_record) |
58 59 60 61 62 | run: if params.is_paired: shell('expr "$(samtools view -c -f 0x41 -F 0xf04 {input})" + "$(samtools view -c -f 0x81 -F 0xf04 {input})" > {output}') else: shell('samtools view -c -F 0xf04 {input} > {output}') |
23 24 25 26 | shell: ''' lbzip2 -dc {input} | sed -e '/^[^>]/s/X/N/g' | lbzip2 > {output} ''' |
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | run: if fastq_pairing == 'interleaved': shell(""" echo {input[0]} echo {input[1]} paste <(paste - - - - < <(lbzcat {input[0]})) \ <(paste - - - - < <(lbzcat {input[1]})) \ | tr '\t' '\n' \ | lbzip2 > {output} """) elif fastq_pairing == 'interleaved_add_suffix': shell(r""" echo {input[0]} echo {input[1]} paste <(paste - - - - < <(awk '{{if (NR % 4 == 1) {{printf "%s/1\n", $1}} \ else if (NR % 4 == 3){{print "+"}} else {{print $0}} }}' <(lbzcat {input[0]}))) \ <(paste - - - - < <(awk '{{if (NR % 4 == 1) {{printf "%s/2\n", $1}} \ else if (NR % 4 == 3){{print "+"}} else {{print $0}} }}' <(lbzcat {input[1]}))) \ | tr '\t' '\n' \ | lbzip2 > {output} """) elif fastq_pairing == 'cat': shell(""" echo {input[0]} echo {input[1]} lbzcat {input[0]} {input[1]} | lbzip2 > {output} """) |
66 67 68 69 70 71 72 | shell: """ echo {input} lbzip2 --decompress --force --keep {input[0]} # If original file is empty touch {output} """ |
79 80 81 82 83 84 85 | shell: """ echo {input} lbzip2 --decompress --force --keep {input[0]} # If original file is empty touch {output} """ |
90 91 92 93 | shell: ''' fastqc -o qc -f fastq <(zcat {input}) ''' |
99 100 101 102 | shell: ''' seqtk seq -a {input} > {output} ''' |
108 109 110 111 | shell: ''' seqtk seq -F F {input} > {output} ''' |
139 140 141 142 143 144 | shell: ''' /usr/bin/time -v -o {log.time} \ {params.exe} {params.args} 2>&1 | tee {log} rm -f {params.ofastq1} {params.ofastq2} ''' |
62 63 64 65 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(PROPHYLE_SHELL) |
73 74 75 76 | shell: ''' metax prophyle-report{params.paired} --total-reads $(cat {input.total_reads}) --output {output.report} --tax-dir {params.db} {input.data} ''' |
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | run: shell(''' mkdir -p db/refseqc/build_prophyle python -c "from ete3 import ncbi_taxonomy; ncbi_taxonomy.NCBITaxa(taxdump_file='{params.tax_db}/taxdump.tar.gz')" {DROPCACHE} ''') shell('''\ awk -F "\t" -v OFS="\t" '$12=="Complete Genome" && $11=="latest" {{print $1, $6, $20}}' {input.assembly_summary} > ftpselection.tsv cut -f 3 ftpselection.tsv | awk 'BEGIN{{FS=OFS="/";filesuffix="genomic.fna.gz"}} {{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}}' > ftpfilepaths.tsv cut -f 1,2 ftpselection.tsv | sed 's/\.[0-9]*//' > acc2taxid.tsv prophyle_ncbi_tree.py archaea {NCBI_GENOMES}/refseq archaea.nw acc2taxid.tsv -l archaea.log prophyle_ncbi_tree.py bacteria {NCBI_GENOMES}/refseq bacteria.nw acc2taxid.tsv -l bacteria.log prophyle_ncbi_tree.py viral {NCBI_GENOMES}/refseq viral.nw acc2taxid.tsv -l viral.log ln -s {NCBI_GENOMES}/refseq/archaea . ln -s {NCBI_GENOMES}/refseq/bacteria . ln -s {NCBI_GENOMES}/refseq/viral . /usr/bin/time -v -o {log.time} \ prophyle index -k 31 archaea.nw bacteria.nw viral.nw refseqc | tee {log.log} ''' , bench_record=bench_record) shell('mv refseqc ../prophyle') |
7 8 9 10 11 | shell: ''' ncbi-genome-download -F fasta,protein-fasta,genbank archaea,bacteria,viral touch {output} ''' |
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | shell: ''' CWD=$(pwd) : > {output} cd {NCBI_GENOMES}/refseq DOMAINS=(archaea bacteria viral) for DOMAIN in ${{DOMAINS[@]}}; do pushd $DOMAIN awk -F "\t" '$12=="Complete Genome" && $11=="latest" {{print $1}}' assembly_summary.txt > ${{DOMAIN}}_selection.txt for i in $(awk -F "\t" '$12=="Complete Genome" {{print $1}}' assembly_summary.txt); do cat $i/*_genomic.fna.gz >> $CWD/{output.fna} cat $i/*_protein.faa.gz >> $CWD/{output.faa} || true {PIGZ} -dc $i/*_genomic.gbff.gz >> $CWD/{output.gbff} done popd done ''' |
56 57 58 59 | shell: ''' {PIGZ} -dk {input} ''' |
67 68 69 70 71 72 | shell: ''' export LC_ALL=C cat <({PIGZ} -dc {input.prot} | tail -n +2) <({PIGZ} -dc {input.dead_prot} | tail -n +2) \ | cut -f2,3 | sort -S {resources.mem}G -k1,1 -t$'\t' > {output} ''' |
81 82 83 84 85 86 | shell: ''' export LC_ALL=C cat <({PIGZ} -dc {input.nucl} | tail -n +2) <({PIGZ} -dc {input.dead_nucl} | tail -n +2) \ | cut -f2,3,4 | sort -S {resources.mem}G -k1,1 -t$'\t' > {output} ''' |
92 93 94 95 96 | shell: ''' export LC_ALL=C zcat {input} | fasta_formatter -t | cut -f1 | sed 's/ /\t/' | sort -k1,1 -s -S {resources.mem}G -t $'\t' > {output} ''' |
103 104 105 106 107 | shell: ''' export LC_ALL=C join -t$'\t' <(sed 's/ /\t/' {input.headers}) {input.a2t} | cut -f1,3,2 > {output} ''' |
114 115 116 117 118 119 | shell: ''' export LC_ALL=C join -t$'\t' {input.a2t} <({PIGZ} -dc {input.fna} | fasta_formatter -t | sed 's/ /\t/' | sort -k1,1 -S {resources.mem}G -t $'\t') | \ awk 'BEGIN {{FS=OFS="\t"}} {{printf(">gi|%s|%s %s\n%s\n", $3, $1, $4, $5)}} > {output} ''' |
9 10 11 12 13 14 15 16 17 18 19 20 | run: shell('echo {input}') if wildcards.seq in samples_fastq: shell('''\ TOTAL_READS=$(expr $(wc -l <(lbzcat {input}) | cut -f1 -d' ') / 4) echo $TOTAL_READS > {output} ''') elif wildcards.seq in samples_fasta: shell('''\ TOTAL_READS=$(fasta_formatter -i <(lbzcat {input}) -t | wc -l) echo $TOTAL_READS > {output} ''') |
26 27 28 29 30 31 32 | shell: ''' for i in {input}; do NAME="$(echo $(basename $i) | sed 's/.total_reads.txt//' )" echo -e "$NAME\t$(cat $i)" >> {output} done ''' |
37 38 39 40 | shell: ''' cat {input} > {output} ''' |
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | shell: ''' OUTPUT=$(readlink -f {output}) pushd db/refseqc cd blast echo -e "megablast\trefseqc\t$(du -c genomic* taxdb* | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../bracken echo -e "bracken\trefseqc\t$(du -c KMER_DISTR.150* database150* | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../centrifuge echo -e "centrifuge\trefseqc\t$(du -c refseqc.*.cf | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../clark echo -e "clark\trefseqc\t$(du -c custom_0/db_central* | tail -n 1 | cut -f1)" >> "$OUTPUT" echo -e "clark_s\trefseqc\t$(du -c custom_0/T** | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../diamond echo -e "diamond\trefseqc\t$(du -c * | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../kaiju echo -e "kaiju\trefseqc\t$(du -c * | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../karp echo -e "karp\trefseqc\t$(du -c karp.tax karp.fasta rrna_karp.index | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../kraken echo -e "kraken\trefseqc\t$(du -c database.idx database.kdb taxonomy | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../kraken2 echo -e "kraken2\trefseqc\t$(du -c *.k2d | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../krakenhll echo -e "krakenhll\trefseqc\t$(du -c database.idx database.kdb taxDB | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../kslam echo -e "kslam\trefseqc\t$(du -c database taxDB | tail -n 1 | cut -f1)" >> "$OUTPUT" # cd ../metaothello # echo -e "metaothello\trefseqc\t$(du -c id2tax names output.index | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../mmseqs2 echo -e "mmseqs2\trefseqc\t$(du -c * | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../prophyle echo -e "prophyle\trefseqc\t$(du -c abv | tail -n 1 | cut -f1)" >> "$OUTPUT" cd ../taxmaps echo -e "taxmaps\trefseqc\t$(du -c refseqc.gem refseqc.len taxonomy.tbl | tail -n 1 | cut -f1)" >> "$OUTPUT" popd pushd db/gottcha echo -e "gottcha\tdefault\t$(du -c GOTTCHA_BACTERIA_c4937_k24_u30_xHUMAN3x.species.* GOTTCHA_VIRUSES_c5900_k24_u30_xHUMAN3x.species.* *.dmp | tail -n 1 | cut -f1)" >> "$OUTPUT" popd pushd db/metaphlan2 echo -e "metaphlan2\tdefault\t$(du -c db_v20 | tail -n 1 | cut -f1)" >> "$OUTPUT" popd pushd ~/efs/miniconda3/envs/metax/share/motus-2.0.1 echo -e "motus2\tdefault\t$(du -c db_mOTU | tail -n 1 | cut -f1)" >> "$OUTPUT" popd pushd db/pathseq echo -e "pathseq\tdefault\t$(du -c pathseq_microbe* pathseq_taxonomy.db | tail -n 1 | cut -f1)" >> "$OUTPUT" popd pushd db/metaothello echo -e "metaothello\tdefault\t$(du -c 20161108 | tail -n 1 | cut -f1)" >> "$OUTPUT" popd ''' |
83
of
rules/reports.smk
147 148 149 150 151 152 153 | shell: ''' for i in {input}; do NAME="$(echo $(basename $i) | sed 's/.txt//' )" echo -e "$NAME\t$(cat $i)" >> {output} done ''' |
161 162 163 164 | shell: ''' metax compile --tax-dir {config[TAXONOMY_DB]} --classified-counts {input.classified_counts} --read-counts {input.total_reads} --dir reports --config config/compile_config.yaml --rank-abundances {output.ranksum} {output.reports} ''' |
81 82 83 84 | run: if benchmark_i == 0: shell('{DROPCACHE}') shell(taxmaps_rule, bench_record=bench_record) |
96 97 98 99 100 101 102 103 104 105 | run: shell('{DROPCACHE}') shell('''\ /usr/bin/time -v -o {log.time} \ env PATH={GEMPATH}:$PATH \ {TAXMAPS_TBL} -n {TAXONOMY_DB}/names.dmp -t {TAXONOMY_DB}/nodes.dmp > {params.dir}/taxonomy.tbl 2>&1 | tee {log.log} /usr/bin/time -a -v -o {log.time} \ env PATH={GEMPATH}:$PATH \ {TAXMAPS_INDEX} -i {input.fna} -c {TAXONOMY_DB}/gi_taxid_nucl.dmp -t taxonomy.tbl -p {params.dir}/refseqc | tee -a {log.log} ''', bench_record=bench_record) |
Support
- Future updates
Related Workflows





