Ley Lab MetaGenome Profiler DataBase generator

public 1yr ago Version: 0.1.7 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Struo

Struo: a pipeline for building custom databases for common metagenome profilers

"Struo" --> from the Latin: “I build” or “I gather”

Version: 0.1.7
Authors:
- Nick Youngblut nyoungb2@gmail.com
- Jacobo de la Cuesta jacobo.delacuesta@tuebingen.mpg.de
Maintainers:
- Nick Youngblut nyoungb2@gmail.com
- Jacobo de la Cuesta jacobo.delacuesta@tuebingen.mpg.de
Previous name
- Ley Lab MetaGenome Profiler DataBase generator (LLMGP-DB)

Citation

Cuesta-Zuluaga, Jacobo de la, Ruth E. Ley, and Nicholas D. Youngblut. 2019. "Struo: A Pipeline for Building Custom Databases for Common Metagenome Profilers." Bioinformatics , November. https://doi.org/10.1093/bioinformatics/btz899

Struo2

Faster than Struo and allows for efficient database updating
- Struo2 repo

Pre-built custom databases

Custom GTDB databases available at the struo data ftp server

GTDB releases available:

Release 86 (14.03.2019)
- Number of genomes included: 21,276
- NCBI taxonomy/taxIDs used
Release 89 (30.08.2019)
- Number of genomes included: 23,361
- GTDB taxdump
  - taxIDs assigned with gtdb_to_taxdump
- Genome phylogeny
  - GTDB ar122_r89.tree & bac120_r89.tree grafted together
- Genome phenotypes
  - Inferred with py3-implementation of Traitar
Release 95 (13.07.2020)
- Number of genomes included: 30,989
- GTDB taxdump
  - taxIDs assigned with gtdb_to_taxdump
- Genome phylogeny
  - GTDB ar122_r95.tree & bac120_r95.tree grafted together

Notes/warnings

The taxdump taxIDs are NOT stable! Do not mix and match among GTDB releases!
You can use ncbi-gtdb_map.py from the gtdb_to_taxdump repo to convert between NCBI and GTDB taxonomies

Tutorial

For a step-by-step example of how to prepare and execute Struo, see the notebook in the ./tutorial/ folder

Description

Struo’s workflow

Struo's workflow encompasses the steps from genome download to database construction

Setup

Download

To download the pipeline, clone the Git repository:

git clone git@github.com:leylabmpi/Struo.git

conda env setup

Versions listed are those that have been tested

python=3.6
snakemake=5.7.0
r-base=3.6
r-argparse=2.0.1
r-curl=4.2
r-data.table=1.12.4
r-dplyr=0.8.3
ncbi-genome-download=0.2.10
newick_utils=1.6

UniRef diamond database(s)

You will need a UniRef diamond database for the humann2 database construction (e.g., UniRef90). See the "Download a translated search database" section of the humann2 docs .

Getting reference genomes for the custom databases

Downloading genomes

If using GTDB genomes, run GTDB_metadata_filter.R to select genomes
If downloading genomes from genbank/refseq, you can use genome_download.R

Example:

# Filtering GTDB metadata to certain genomes
./GTDB_metadata_filter.R -o gtdb-r89_bac-arc.tsv https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/bac120_metadata_r89.tsv https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/ar122_metadata_r89.tsv
# Downloading all genomes (& creating tab-delim table of genome info)
./genome_download.R -o genomes -p 8 gtdb-r89_bac-arc.tsv > genomes.txt
# Note: the output of ./genome_download.R can be directly used for running the `Struo` pipeline (see below)

User-provided databases

Users can also provide genomes as compressed fasta files ( .fna.gz ). This also requires adding the corresponding information to the samples.txt file (see below)

Input data ( `samples.txt` file)

The table of input files/data can be created using the helper scripts described above.

The pipeline requires a tab-delimited table that includes the following columns (column names specified in the config.yaml file):
- Sample ID
  - This will usually just be the species/strain names
- Path to the genome assembly fasta file
  - NOTE: these must be gzip'ed
- taxonomy ID
  - This should be the NCBI taxonomy ID at the species/strain level
    - Needed for Kraken
- taxonomy
  - This should at least include g__<genus>;s__<species>
  - The taxonomy can include higher levels, as long as levels 6 & 7 are genus and species
  - Any taxonomy lacking genus and/or species levels will be labeled:
    - g__unclassified (if no genus)
    - s__unclassified (if no species)
  - This is needed for humann2

Other columns in the file will be ignored. The path to the samples file should be specified in the config.yaml file (see below)

Using the GTDB taxonomy instead of NCBI taxIDs

kraken2 & humann2 databases used NCBI taxIDs, and thus the NCBI taxonomy is used by default for Struo . You can instead create custom taxIDs from the GTDB taxonomy with gtdb_to_taxdump .

The resulting names.dmp and nodes.dmp files, along with a genome metadata file that includes the gtdb_taxids, then you can modify the Struo pipeline to fully use the GTDB taxonomy & taxIDs. You will need to modify the config.yaml file (see "If using GTDB taxIDs" below).

Running the pipeline

Edit the `config.yaml`

Specify the input/output paths
Modify parameters as needed
- Make sure to add the path to the UniRef diamond database for HUMAnN2
  - see above for instructions on retrieving this file
The samples_col: column specified should contain unique values
- The default ncbi_organism_name column in the GTDB metadata does not contain unique values
- For the pre-built Struo databases, we merged assembly accessions with the original ncbi_organism_name
  - eg., GB_GCA_001784635.1_Candidatus Micrarchaeota archaeon RBG_16_49_10
Modify temp_folder: if needed
- This folder is used just for read/write of temporary files

If using GTDB taxIDs

If you have followed "Using the GTDB taxonomy instead of NCBI taxIDs" above, then make the following modifications to the config.yaml file:

## column names in samples table
taxID_col: 'gtdb_taxid'
taxonomy_col: 'gtdb_taxonomy'
#-- if custom NCBI taxdump files --#
names_dmp: /YOUR/PATH/TO/names.dmp
nodes_dmp: /YOUR/PATH/TO/nodes.dmp

Running locally

snakemake --use-conda

Running on a cluster

If SGE, then you can use the snakemake_sge.sh script. You can create a similar bash script for other cluster architectures. See the following resources for help:

General info on using `snakemake`

Snakemake allows for easy re-running of the pipeline on just genomes that have not yet been processed. You can just add more genomes to the input table and re-run the pipeline (test first with --dryrun ). Snakemake should just process the new genomes and then re-create the combined dataset files (this must be done each time). Make sure to not mess with the files in the nuc_filtered and prot_filtered directories! Otherwise, snakemake may try to run all genomes again through the computationally expensive gene annotation process.

Using the resulting databases

Set the database paths in humann2, kraken2, etc. to the new, custom database files.

humann2
- nucleotide
  - all_genes_annot.fna.gz
- amino acid
  - all_genes_annot.dmnd
kraken2
- database*mers.kraken

Example of a humann2 run

Run humann2 with custom databases created by Struo. Change that PATHs as necessary.

STRUO_OUT_DIR=./struo_output/
NUC_DB=`dirname $STRUO_OUT_DIR"/all_genes_annot.fna.gz"`
PROT_DB=`dirname $STRUO_OUT_DIR"/all_genes_annot.dmnd"`
MTPHLN_BT2_DB=`dirname ./metaphlan2_db/mpa_v20_m200/mpa_v20_m200.1.bt2`
MTPHLN_PKL_DB=/ebio/abt3_projects2/databases_no-backup/metaphlan2/mpa_v20_m200/mpa_v20_m200.pkl
humann2 --gap-fill on --bypass-nucleotide-index \
 --nucleotide-database $NUC_DB \
 --protein-database $PROT_DB \
 --metaphlan-options "Skip --mpa_pkl $MTPHLN_PKL_DB --bowtie2db $MTPHLN_BT2_DB" \
 --tmp-dir /dev/shm/humann2_temp/ \
 --threads 12 \
 --input-format fastq \
 --output-basename SRS018656 \
 --input SRS018656_R1.fq

Adding more samples (genomes) to an existing custom DB

If you set keep_intermediate: True for your initial run, then the intermediate files from the computationally intensive steps are kept, and so those genomes don't have to be reprocessed. Only new genomes will be processed, and then the database(s) will be re-created with old + new genomes.

To create a database with more genomes:

Add new genomes to the input table.
If you want to over-write your old databases:
- DO NOT change the db_name: parameter in the config.yaml file
OR if you want to create new database:
- Change the db_name: parameter in the config.yaml file
Re-run the snakemake pipeline.
- Snakemake should skip the genomes that have already been processed.
- Use --dryrun to see what snakemake is going to do before actually running the pipeline.
- You may need to set use_ancient: True in order to have snakemake skip the diamond mapping for humann2
  - This is needed if the timestamps on the genome gene files have been (accidently) modified since the last run.

Adding existing gene sequences to humann2 databases

If you have gene sequences already formatted for creating a humann2 custom DB, and you'd like to include them with the gene sequences generated from the input genomes, then just provide the file paths to the nuc/prot fasta files ( humann2_nuc_seqs and humann2_prot_seqs in the config.yaml file).

All genes (from genomes & user-provided) will be clustered altogether with vsearch . See the vsearch_all: setting in the config.yaml for the default clustering parameters used. You can use vsearch_all: Skip to skip the clustering and instead all of the sequences will just be combined without removing redundancies.

Utilities

`GTDB_metadata_filter.R`

This tool is useful for selecting which GTDB genomes to include in a custom database.

Filter >=1 genome assembly metadata file (e.g., bac120_metadata_r89.tsv) by assembly quality or other parameters.

`genome_download.R`

This tool is useful for downloading genomes from NCBI.

Download a set of genomes based on NCBI assembly accessions provided in a table. The file paths of the downloaded genome fasta files will be appended to the input table.

`tree_prune.py`

This tool is useful for creating a GTDB archaea/bacteria phylogeny of all genomes in your custom database. The phylogeny can be used for phylogenetic analyses of metagenomes (e.g., Faith's PD or Unifrac).

Prune >=1 phylogeny to just certain taxa. If >1 phylogeny provided, then the phylogenies are merged.

`gtdb_to_taxdump`

This is a separate repo .

This is useful for creating an NCBI taxdump (names.dmp and nodes.dmp) from the GTDB taxonomy. Note that the taxIDs are arbitrary and don't match anything in the NCBI!

TODO

Create a diamond DB using diamond >=0.9 so that users can run humann2 with the most up-to-date version of diamond
- Note this will require creating an updated UniRef50 db

Code Snippets

shell:
    """
    # location of the kraken2 db files
    DB=`dirname {input.kraken2_db}`
    # removing existing files possibly created by bracken
    TMP_FILE=$DB"/database.kraken"
    rm -f $TMP_FILE

    {params.exe} -t {threads} -d $DB \
      -k {params.kmer} -l {params.read_len} \
      2> {log} 1>&2
    """

SnakeMake kraken2 bracken From line 26 of bracken/Snakefile

shell:
    """
    gunzip -c {input.fasta} | \
      prodigal {params.params} \
      -o {output.gbk} -d {output.fna} -a {output.faa} \
      2> {log} 1>&2
    """    

SnakeMake prodigal From line 27 of humann2/Snakefile

shell:
    """
    DB="{input.dmnd_db}"
    TMPDIR="{params.tmp_dir}"
    mkdir -p $TMPDIR 2>> {log}

    # diamond run
    diamond blastp {params.params} \
       -q {input.faa} -d $DB -o {output.hits} \
       --tmpdir $TMPDIR --threads {threads} \
       --outfmt 6 qseqid sseqid pident length qstart qend qlen sstart send slen evalue \
       2>> {log} 1>&2
    """

SnakeMake Diamond From line 60 of humann2/Snakefile

shell:
    """
    OUTDIR=`dirname {output.fna}`
    {params.exe} --gzip --outdir $OUTDIR \
       --columns qseqid,sseqid,pident,length,qstart,qend,qlen,sstart,send,slen,evalue \
       {input.hits} {input.fna} {input.faa} "{params.tax}" {params.taxID} \
       2> {log} 1>&2
    """

SnakeMake From line 102 of humann2/Snakefile

shell:
    """
    OUTDIR=`dirname {output.fna}`
    {params.exe} --outdir $OUTDIR \
       --columns qseqid,sseqid,pident,length,qstart,qend,qlen,sstart,send,slen,evalue \
       {input.hits} {input.fna} {input.faa} "{params.tax}" {params.taxID} \
       2> {log} 1>&2
    """

SnakeMake From line 139 of humann2/Snakefile

shell:
    """
    # cluster         
    if [ "{params.keep_int}" == "True" ]; then
      vsearch {params.params} \
        --threads {threads} \
        --cluster_fast <(gunzip -c {input.fna}) \
        --centroids {output.reps} \
        2> {log} 1>&2          

        # filter to just reps (duplicates removed)
        {params.exe} --gzip {output.reps} {input.faa} \
          {output.fna} {output.faa} 2> {log} 1>&2
    else
      vsearch {params.params} \
        --threads {threads} \
        --cluster_fast {input.fna} \
        --centroids {output.reps} \
        2> {log} 1>&2

        # filter to just reps (duplicates removed)
        {params.exe} --gzip {output.reps} {input.faa} \
          {output.fna} {output.faa} 2> {log} 1>&2
    fi
    """

SnakeMake VSEARCH From line 196 of humann2/Snakefile

run:
    import os,gzip
    with open(output[0], 'w') as outF:
        for F in input:
            with gzip.open(F, 'rb') as inF:
                for line in inF:
                    outF.write(line.decode('utf-8'))

SnakeMake From line 249 of humann2/Snakefile

run:
    import os,gzip
    with open(output[0], 'w') as outF:
        for F in input:
            with gzip.open(F, 'rb') as inF:
                for line in inF:
                    outF.write(line.decode('utf-8'))

SnakeMake From line 282 of humann2/Snakefile

shell:
    """
     # cluster
     vsearch {params.params} \
       --threads {threads} \
       --cluster_fast {input.fna} \
       --centroids {output.reps} \
       2> {log} 1>&2

    # filter AA seqs to just reps (duplicates removed)
    {params.exe} --gzip {output.reps} {input.faa} \
       {output.fna} {output.faa} 2> {log} 1>&2
    """

SnakeMake VSEARCH From line 319 of humann2/Snakefile

shell:
    """
    pigz -p {threads} -c {input.fna} > {output.fna} 2> {log}
    pigz -p {threads} -c {input.faa} > {output.faa} 2>> {log}
    """

SnakeMake From line 355 of humann2/Snakefile

shell:
    """
    bowtie2-build --threads {threads} \
      {input} {params.prefix} 2> {log} 1>&2

    # check that output exists
    OUTDIR=`dirname {params.prefix}`
    IDX_FILES=`find $OUTDIR -maxdepth 1 -name "*.bt2*"`
    IDX_FILES=`echo $IDX_FILES | perl -pe 's/ +/\n/g' | wc -l`
    if [ $IDX_FILES -lt 1 ]; then
      echo "ERROR: no bowtie2 index files found!"
      exit 1
    fi
    """    

SnakeMake Bowtie 2 From line 383 of humann2/Snakefile

shell:
    """
    PREF=`echo {output} | perl -pe 's/\.[^.]+$//'`
    diamond makedb --in {input} -d $PREF 2> {log} 1>&2
    """    

SnakeMake Diamond From line 418 of humann2/Snakefile

shell:
    """
    cp -f {input.names} {output.names} 2> {log}
    cp -f {input.nodes} {output.nodes} 2>> {log}
    """

SnakeMake From line 16 of kraken2/Snakefile

shell:
    """
    OUTDIR=`dirname {output.gb}`
    OUTDIR=`dirname $OUTDIR`
    rm -rf $OUTDIR 2> {log}
    mkdir -p $OUTDIR 2>> {log}
    echo "# Downloading NCBI taxonomy to $OUTDIR" >> {log}
    {params.exe} --use-ftp --download-taxonomy --db $OUTDIR 2>> {log} 1>&2
    """

SnakeMake From line 42 of kraken2/Snakefile

shell:
    """
    {params.exe} {input.fasta} {params.taxID} > {output} 2> {log}
    """

SnakeMake From line 76 of kraken2/Snakefile

shell:
    """
    DB=`dirname {input.names}`
    DB=`dirname $DB`

    kraken2-build --db $DB --add-to-library {input.fasta}  2> {log} 1>&2
    touch {output} 2>> {log}
    """

SnakeMake kraken2 From line 103 of kraken2/Snakefile

shell:
    """
    DB=`dirname {output.hash}`
    kraken2-build --build --threads {threads} --db $DB 2> {log} 1>&2
    """

SnakeMake kraken2 From line 136 of kraken2/Snakefile

ShowHide 12 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/leylabmpi/Struo

Name: struo

Version: 0.1.7

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

bracken prodigal Bowtie 2 Diamond kraken2 Snakemake VSEARCH

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free

Ley Lab MetaGenome Profiler DataBase generator

Help improve this workflow!

Struo

Citation

Struo2

Pre-built custom databases

Notes/warnings

Tutorial

Description

Struo’s workflow

Setup

Download

conda env setup

UniRef diamond database(s)

Getting reference genomes for the custom databases

Downloading genomes

User-provided databases

Input data ( samples.txt file)

Using the GTDB taxonomy instead of NCBI taxIDs

Running the pipeline

Edit the config.yaml

If using GTDB taxIDs

Running locally

Running on a cluster

General info on using snakemake

Using the resulting databases

Example of a humann2 run

Adding more samples (genomes) to an existing custom DB

Adding existing gene sequences to humann2 databases

Utilities

GTDB_metadata_filter.R

genome_download.R

tree_prune.py

gtdb_to_taxdump

TODO

Code Snippets

Comments

Support

Free

Related Workflows

public

public

public

public

public

public

Input data ( `samples.txt` file)

Edit the `config.yaml`

General info on using `snakemake`

`GTDB_metadata_filter.R`

`genome_download.R`

`tree_prune.py`

`gtdb_to_taxdump`