Shotgun metagenomic sequencing processing pipeline
VDB Shotgun Pipeline
Prerequisites
-
Apptainer/Singularity : while in many cases we do provide conda envs the only method of execution we support is via containers.
-
(optional) A Snakemake Profile : this coordinates the execution of jobs on whatever hardware you are using.
Important Notes :
-
Set the location of your profile to the environment variable
$SNAKEMAKE_PROFILE
(egexport SNAKEMAKE_PROFILE=/path/to/your/profile/
) -
For the purposes of the examples, we added the
--dry-run
flag for the user to preview the rules to be executed. Remove this step to execute the commands. -
All database paths are configured in
config/config.yaml
Change the paths to reflect where the databases can be found on your machine. For a uniform way to fetch and build all the databases, see https://github.com/vdblab/resources
Main Pipeline
Usage
snakemake \
--directory tmpout/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
nshards=4 \
stage=all \
--dry-run
Outputs
-
MultiQC-ready reports
-
Microbe relative abundances (MetaPhlAn3, Kraken2)
-
Metabolic pathway relative abundances (HUMAnN3)
-
Metagenome assembled genomes (MetaSPAdes)
-
AMR profiles with Abricate and RGI
-
MAGs with MetaWRAP (Metabat2, CONCOCT, Maxbin2)
-
Gene prediction and annotation (MetaErg)
-
Secondary metabolite gene clusters (antiSMASH)
-
Antimicrobial resistance and virulence genes (ABRicate, AMRFinderPlus)
-
Carbohydrate active enzyme (CAZyme) annotation (dbCAN3)
Workflow
The rule DAG for a single sample looks like this:
Different modules of the workflow can be run indenpendently using the
stage
config entry.
MultiQC
Just run MultiQC on a directory, no need to use Snakemake
cp -r tmppre/reports tmpreports
cp tmpassembly/quast/quast_473/report.tsv ./tmpreports/
ver="v1.12"
docker run -V $PWD:$PWD docker://ewels/multiqc:${ver} multiqc \
--config vdb_shotgun/multiqc_config.yaml --force \
--title "a multiqc report for some test data" \
-b "generated by ${ver}" --filename multiqc_report.html \
reports/ --interactive
Preprocessing
snakemake \
--directory tmppreprocess/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
nshards=4 \
dedup_platform=NovaSeq \
stage=preprocess \
--dry-run
Tools used
-
Snap ( site | [paper](https://doi.org/10.1101/2021.11.23.469039 History))
-
FastQC ( site )
-
2-step host removal descibed here , extended to use both human and mouse genomes
Biobakery
snakemake \
--directory tmpbiobakery/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
stage=biobakery \
--dry-run
Tools used
Kraken2/Bracken
snakemake \
--directory tmpkraken/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
dedup_platform=NovaSeq \
stage=kraken \
--dry-run
Tools used
Assembly
snakemake \
--directory tmpassembly/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
stage=assembly \
--dry-run
Tools used
Annotation
snakemake \
--directory tmpannotate/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
assembly=tmpassembly/473.contigs.fasta \
stage=annotate \
--dry-run
Tools used
-
ABRicate ( site )
Binning
snakemake \
--directory tmpbinning/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
assembly=tmpassembly/473.contigs.fasta \
stage=binning \
--dry-run
RGI
snakemake \
--directory tmprgi/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
stage=rgi \
--dry-run
Tools used
Strainphlan Pipeline
This pipeline StrainPhlAn for each specified species. Strainphlan requires two inputs: sample-level marker pickle files, and strain-level markers extracted from the main database. These are stored in central subdirectory in the Metaphlan database directory to aid re-running. If you provide the .sam.bz2 file for a samples that has already been processed into a pkl file, it will use the pregenerated result.
This workflow accepts as input a list of sample's metaphlan
sam.bz2
alignment files, and a list of species of interest. A config argument
strainphlan_markers_dir
serves as a central place for storing both the species- and the sample-level marker files; these are specific to a version of the MetaPHlan database, so we recommend placing that within the metaphlan database directory.
Usage
snakemake \
--snakefile workflow/strainphlan.smk \
--directory tmpstrain/ \
--config \
sams=[path/to/sample1.sam.bz2,path/to/sample2.sam.bz2] \
strainphlan_markers_dir=/data/brinkvd/resources/dbs/metaphlan/mpa_vJan21_CHOCOPhlAnSGB_202103/marker_outputs/ \
metaphlan_db=/data/brinkvd/resources/dbs/metaphlan/mpa_vJan21_CHOCOPhlAnSGB_202103/ \
marker_in_n_samples=2 \
--dry-run
Outputs
For each input species:
-
Multiple sequence alignment of strains detected in samples
-
Phylogenetic tree of strains detected in samples
Workflow
The rule DAG for two example input species looks like this:
Testing and Development
Please see
development.md
.
Code Snippets
147 148 | script: "../scripts/parse_antismash_gbk.py" |
466 467 | script: "../scripts/merge_logs.py" |
82 83 | script: "../scripts/plot_RGI_heatmap.R" |
Support
- Future updates
Related Workflows





