Ley Lab MetaGenome Profiler DataBase generator
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Struo
Struo: a pipeline for building custom databases for common metagenome profilers
"Struo" --> from the Latin: “I build” or “I gather”
-
Version: 0.1.7
-
Authors:
-
Nick Youngblut nyoungb2@gmail.com
-
Jacobo de la Cuesta jacobo.delacuesta@tuebingen.mpg.de
-
-
Maintainers:
-
Nick Youngblut nyoungb2@gmail.com
-
Jacobo de la Cuesta jacobo.delacuesta@tuebingen.mpg.de
-
-
Previous name
- Ley Lab MetaGenome Profiler DataBase generator (LLMGP-DB)
Citation
Cuesta-Zuluaga, Jacobo de la, Ruth E. Ley, and Nicholas D. Youngblut. 2019. "Struo: A Pipeline for Building Custom Databases for Common Metagenome Profilers." Bioinformatics , November. https://doi.org/10.1093/bioinformatics/btz899
Struo2
-
Faster than Struo and allows for efficient database updating
Pre-built custom databases
Custom GTDB databases available at the struo data ftp server
GTDB releases available:
-
Release 86 (14.03.2019)
-
Number of genomes included: 21,276
-
NCBI taxonomy/taxIDs used
-
-
Release 89 (30.08.2019)
-
Number of genomes included: 23,361
-
GTDB taxdump
- taxIDs assigned with gtdb_to_taxdump
-
Genome phylogeny
-
GTDB
ar122_r89.tree
&bac120_r89.tree
grafted together
-
GTDB
-
Genome phenotypes
- Inferred with py3-implementation of Traitar
-
-
Release 95 (13.07.2020)
-
Number of genomes included: 30,989
-
GTDB taxdump
- taxIDs assigned with gtdb_to_taxdump
-
Genome phylogeny
-
GTDB
ar122_r95.tree
&bac120_r95.tree
grafted together
-
GTDB
-
Notes/warnings
-
The taxdump taxIDs are NOT stable! Do not mix and match among GTDB releases!
-
You can use
ncbi-gtdb_map.py
from the gtdb_to_taxdump repo to convert between NCBI and GTDB taxonomies
Tutorial
For a step-by-step example of how to prepare and execute Struo, see the notebook in the
./tutorial/
folder
Description
Struo’s workflow
Struo's workflow encompasses the steps from genome download to database construction
Setup
Download
To download the pipeline, clone the Git repository:
git clone git@github.com:leylabmpi/Struo.git
conda env setup
Versions listed are those that have been tested
-
python=3.6
-
snakemake=5.7.0
-
r-base=3.6
-
r-argparse=2.0.1
-
r-curl=4.2
-
r-data.table=1.12.4
-
r-dplyr=0.8.3
-
ncbi-genome-download=0.2.10
-
newick_utils=1.6
UniRef diamond database(s)
You will need a UniRef diamond database for the humann2 database construction (e.g., UniRef90). See the "Download a translated search database" section of the humann2 docs .
Getting reference genomes for the custom databases
Downloading genomes
-
If using GTDB genomes, run
GTDB_metadata_filter.R
to select genomes -
If downloading genomes from genbank/refseq, you can use
genome_download.R
Example:
# Filtering GTDB metadata to certain genomes
./GTDB_metadata_filter.R -o gtdb-r89_bac-arc.tsv https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/bac120_metadata_r89.tsv https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/ar122_metadata_r89.tsv
# Downloading all genomes (& creating tab-delim table of genome info)
./genome_download.R -o genomes -p 8 gtdb-r89_bac-arc.tsv > genomes.txt
# Note: the output of ./genome_download.R can be directly used for running the `Struo` pipeline (see below)
User-provided databases
Users can also provide genomes as compressed fasta files (
.fna.gz
).
This also requires adding the corresponding information to the
samples.txt
file (see below)
Input data (
samples.txt
file)
The table of input files/data can be created using the helper scripts described above.
-
The pipeline requires a tab-delimited table that includes the following columns (column names specified in the
config.yaml
file):-
Sample ID
- This will usually just be the species/strain names
-
Path to the genome assembly fasta file
- NOTE: these must be gzip'ed
-
taxonomy ID
-
This should be the NCBI taxonomy ID at the species/strain level
- Needed for Kraken
-
-
taxonomy
-
This should at least include
g__<genus>;s__<species>
-
The taxonomy can include higher levels, as long as levels 6 & 7 are genus and species
-
Any taxonomy lacking genus and/or species levels will be labeled:
-
g__unclassified
(if no genus) -
s__unclassified
(if no species)
-
-
This is needed for humann2
-
-
Other columns in the file will be ignored. The path to the samples file should be specified in the
config.yaml
file (see below)
Using the GTDB taxonomy instead of NCBI taxIDs
kraken2 & humann2 databases used NCBI taxIDs, and thus the NCBI taxonomy is used by default
for
Struo
. You can instead create custom taxIDs from the GTDB taxonomy with
gtdb_to_taxdump
.
The resulting
names.dmp
and
nodes.dmp
files, along with a genome metadata file that includes the gtdb_taxids,
then you can modify the Struo pipeline to fully use the GTDB taxonomy & taxIDs.
You will need to modify the
config.yaml
file (see "If using GTDB taxIDs" below).
Running the pipeline
Edit the
config.yaml
-
Specify the input/output paths
-
Modify parameters as needed
-
Make sure to add the path to the UniRef diamond database for HUMAnN2
- see above for instructions on retrieving this file
-
-
The
samples_col:
column specified should contain unique values-
The default
ncbi_organism_name
column in the GTDB metadata does not contain unique values -
For the pre-built Struo databases, we merged assembly accessions with the original ncbi_organism_name
-
eg.,
GB_GCA_001784635.1_Candidatus Micrarchaeota archaeon RBG_16_49_10
-
eg.,
-
-
Modify
temp_folder:
if needed- This folder is used just for read/write of temporary files
If using GTDB taxIDs
If you have followed "Using the GTDB taxonomy instead of NCBI taxIDs" above, then
make the following modifications to the
config.yaml
file:
## column names in samples table
taxID_col: 'gtdb_taxid'
taxonomy_col: 'gtdb_taxonomy'
#-- if custom NCBI taxdump files --#
names_dmp: /YOUR/PATH/TO/names.dmp
nodes_dmp: /YOUR/PATH/TO/nodes.dmp
Running locally
snakemake --use-conda
Running on a cluster
If SGE, then you can use the
snakemake_sge.sh
script. You can create a similar bash script
for other cluster architectures. See the following resources for help:
General info on using
snakemake
Snakemake allows for easy re-running of the pipeline on just genomes that have not yet been processed.
You can just add more genomes to the input table and re-run the pipeline (test first with
--dryrun
).
Snakemake should just process the new genomes and then re-create the combined dataset files (this must be done each time).
Make sure to not mess with the files in the
nuc_filtered
and
prot_filtered
directories! Otherwise,
snakemake may try to run all genomes again through the computationally expensive gene annotation process.
Using the resulting databases
Set the database paths in humann2, kraken2, etc. to the new, custom database files.
-
humann2
-
nucleotide
-
all_genes_annot.fna.gz
-
-
amino acid
-
all_genes_annot.dmnd
-
-
-
kraken2
-
database*mers.kraken
-
Example of a humann2 run
Run humann2 with custom databases created by Struo. Change that PATHs as necessary.
STRUO_OUT_DIR=./struo_output/
NUC_DB=`dirname $STRUO_OUT_DIR"/all_genes_annot.fna.gz"`
PROT_DB=`dirname $STRUO_OUT_DIR"/all_genes_annot.dmnd"`
MTPHLN_BT2_DB=`dirname ./metaphlan2_db/mpa_v20_m200/mpa_v20_m200.1.bt2`
MTPHLN_PKL_DB=/ebio/abt3_projects2/databases_no-backup/metaphlan2/mpa_v20_m200/mpa_v20_m200.pkl
humann2 --gap-fill on --bypass-nucleotide-index \
--nucleotide-database $NUC_DB \
--protein-database $PROT_DB \
--metaphlan-options "Skip --mpa_pkl $MTPHLN_PKL_DB --bowtie2db $MTPHLN_BT2_DB" \
--tmp-dir /dev/shm/humann2_temp/ \
--threads 12 \
--input-format fastq \
--output-basename SRS018656 \
--input SRS018656_R1.fq
Adding more samples (genomes) to an existing custom DB
If you set
keep_intermediate: True
for your initial run, then the
intermediate files from the computationally intensive steps are kept,
and so those genomes don't have to be reprocessed. Only new genomes will
be processed, and then the database(s) will be re-created with old + new
genomes.
To create a database with more genomes:
-
Add new genomes to the input table.
-
If you want to over-write your old databases:
-
DO NOT change the
db_name:
parameter in the config.yaml file
-
DO NOT change the
-
OR if you want to create new database:
-
Change the
db_name:
parameter in the config.yaml file
-
Change the
-
Re-run the snakemake pipeline.
-
Snakemake should skip the genomes that have already been processed.
-
Use
--dryrun
to see what snakemake is going to do before actually running the pipeline. -
You may need to set
use_ancient: True
in order to have snakemake skip the diamond mapping for humann2- This is needed if the timestamps on the genome gene files have been (accidently) modified since the last run.
-
Adding existing gene sequences to humann2 databases
If you have gene sequences already formatted for creating a humann2 custom DB,
and you'd like to include them with the gene sequences generated from the
input genomes, then just provide the file paths to the nuc/prot fasta files
(
humann2_nuc_seqs
and
humann2_prot_seqs
in the
config.yaml
file).
All genes (from genomes & user-provided) will be clustered altogether with
vsearch
.
See the
vsearch_all:
setting in the
config.yaml
for the default clustering parameters used.
You can use
vsearch_all: Skip
to skip the clustering and instead all of the sequences
will just be combined without removing redundancies.
Utilities
GTDB_metadata_filter.R
This tool is useful for selecting which GTDB genomes to include in a custom database.
Filter >=1 genome assembly metadata file (e.g., bac120_metadata_r89.tsv) by assembly quality or other parameters.
genome_download.R
This tool is useful for downloading genomes from NCBI.
Download a set of genomes based on NCBI assembly accessions provided in a table. The file paths of the downloaded genome fasta files will be appended to the input table.
tree_prune.py
This tool is useful for creating a GTDB archaea/bacteria phylogeny of all genomes in your custom database. The phylogeny can be used for phylogenetic analyses of metagenomes (e.g., Faith's PD or Unifrac).
Prune >=1 phylogeny to just certain taxa. If >1 phylogeny provided, then the phylogenies are merged.
gtdb_to_taxdump
This is a separate repo .
This is useful for creating an NCBI taxdump (names.dmp and nodes.dmp) from the GTDB taxonomy. Note that the taxIDs are arbitrary and don't match anything in the NCBI!
TODO
-
Create a diamond DB using
diamond >=0.9
so that users can run humann2 with the most up-to-date version of diamond- Note this will require creating an updated UniRef50 db
Code Snippets
26 27 28 29 30 31 32 33 34 35 36 37 | shell: """ # location of the kraken2 db files DB=`dirname {input.kraken2_db}` # removing existing files possibly created by bracken TMP_FILE=$DB"/database.kraken" rm -f $TMP_FILE {params.exe} -t {threads} -d $DB \ -k {params.kmer} -l {params.read_len} \ 2> {log} 1>&2 """ |
27 28 29 30 31 32 33 | shell: """ gunzip -c {input.fasta} | \ prodigal {params.params} \ -o {output.gbk} -d {output.fna} -a {output.faa} \ 2> {log} 1>&2 """ |
60 61 62 63 64 65 66 67 68 69 70 71 72 | shell: """ DB="{input.dmnd_db}" TMPDIR="{params.tmp_dir}" mkdir -p $TMPDIR 2>> {log} # diamond run diamond blastp {params.params} \ -q {input.faa} -d $DB -o {output.hits} \ --tmpdir $TMPDIR --threads {threads} \ --outfmt 6 qseqid sseqid pident length qstart qend qlen sstart send slen evalue \ 2>> {log} 1>&2 """ |
102 103 104 105 106 107 108 109 | shell: """ OUTDIR=`dirname {output.fna}` {params.exe} --gzip --outdir $OUTDIR \ --columns qseqid,sseqid,pident,length,qstart,qend,qlen,sstart,send,slen,evalue \ {input.hits} {input.fna} {input.faa} "{params.tax}" {params.taxID} \ 2> {log} 1>&2 """ |
139 140 141 142 143 144 145 146 | shell: """ OUTDIR=`dirname {output.fna}` {params.exe} --outdir $OUTDIR \ --columns qseqid,sseqid,pident,length,qstart,qend,qlen,sstart,send,slen,evalue \ {input.hits} {input.fna} {input.faa} "{params.tax}" {params.taxID} \ 2> {log} 1>&2 """ |
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | shell: """ # cluster if [ "{params.keep_int}" == "True" ]; then vsearch {params.params} \ --threads {threads} \ --cluster_fast <(gunzip -c {input.fna}) \ --centroids {output.reps} \ 2> {log} 1>&2 # filter to just reps (duplicates removed) {params.exe} --gzip {output.reps} {input.faa} \ {output.fna} {output.faa} 2> {log} 1>&2 else vsearch {params.params} \ --threads {threads} \ --cluster_fast {input.fna} \ --centroids {output.reps} \ 2> {log} 1>&2 # filter to just reps (duplicates removed) {params.exe} --gzip {output.reps} {input.faa} \ {output.fna} {output.faa} 2> {log} 1>&2 fi """ |
249 250 251 252 253 254 255 | run: import os,gzip with open(output[0], 'w') as outF: for F in input: with gzip.open(F, 'rb') as inF: for line in inF: outF.write(line.decode('utf-8')) |
282 283 284 285 286 287 288 | run: import os,gzip with open(output[0], 'w') as outF: for F in input: with gzip.open(F, 'rb') as inF: for line in inF: outF.write(line.decode('utf-8')) |
319 320 321 322 323 324 325 326 327 328 329 330 331 | shell: """ # cluster vsearch {params.params} \ --threads {threads} \ --cluster_fast {input.fna} \ --centroids {output.reps} \ 2> {log} 1>&2 # filter AA seqs to just reps (duplicates removed) {params.exe} --gzip {output.reps} {input.faa} \ {output.fna} {output.faa} 2> {log} 1>&2 """ |
355 356 357 358 359 | shell: """ pigz -p {threads} -c {input.fna} > {output.fna} 2> {log} pigz -p {threads} -c {input.faa} > {output.faa} 2>> {log} """ |
383 384 385 386 387 388 389 390 391 392 393 394 395 396 | shell: """ bowtie2-build --threads {threads} \ {input} {params.prefix} 2> {log} 1>&2 # check that output exists OUTDIR=`dirname {params.prefix}` IDX_FILES=`find $OUTDIR -maxdepth 1 -name "*.bt2*"` IDX_FILES=`echo $IDX_FILES | perl -pe 's/ +/\n/g' | wc -l` if [ $IDX_FILES -lt 1 ]; then echo "ERROR: no bowtie2 index files found!" exit 1 fi """ |
418 419 420 421 422 | shell: """ PREF=`echo {output} | perl -pe 's/\.[^.]+$//'` diamond makedb --in {input} -d $PREF 2> {log} 1>&2 """ |
16 17 18 19 20 | shell: """ cp -f {input.names} {output.names} 2> {log} cp -f {input.nodes} {output.nodes} 2>> {log} """ |
42 43 44 45 46 47 48 49 50 | shell: """ OUTDIR=`dirname {output.gb}` OUTDIR=`dirname $OUTDIR` rm -rf $OUTDIR 2> {log} mkdir -p $OUTDIR 2>> {log} echo "# Downloading NCBI taxonomy to $OUTDIR" >> {log} {params.exe} --use-ftp --download-taxonomy --db $OUTDIR 2>> {log} 1>&2 """ |
76 77 78 79 | shell: """ {params.exe} {input.fasta} {params.taxID} > {output} 2> {log} """ |
103 104 105 106 107 108 109 110 | shell: """ DB=`dirname {input.names}` DB=`dirname $DB` kraken2-build --db $DB --add-to-library {input.fasta} 2> {log} 1>&2 touch {output} 2>> {log} """ |
136 137 138 139 140 | shell: """ DB=`dirname {output.hash}` kraken2-build --build --threads {threads} --db $DB 2> {log} 1>&2 """ |
Support
- Future updates
Related Workflows





