Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Table of Contents
Background and hypothesis
The recent advances in high-throughput sequencing have given us increasing evidence of associations between the microbiome and human cancers [1] . While many standalone platforms and R packages such as QIIME2 , DADA2 , phyloseq , and vegan have been published and widely used to conduct microbiome analysis, limited workflows allow integration of these tools. This pipeline aims to use Snakemake , QIIME2 and phyloseq to build a flexible and automated workflow to perform reproducible, in-depth analysis on any paired-end V4 16S rRNA microbiome data with two groups of interest and another variable that could account for variance in the dataset.
The dataset used in this pipeline was first described by Tsementzi et al. (2020)
[2]
. The objective of this study was to compare vaginal microbiota in gynecologic cancers (endometrial/cervical) patients pre- and post- radiation therapy and healthy women. For this project, the dataset was randomly subsampled to include five pre- and five post- radiation therapy (two groups of interest) from cervical cancer patients of different ethnicity (another variable). The hypothesis to test is that the microbiome composition is different within the two groups of interest,nd a proportion of the variance is explained by an additional variable (ethnicity). The workflow aims to provide a list of plots that describe rarefaction curves, alpha and diversity, the relative abundance of organisms in the two groups, hierarchical clustering, and phylogenetic trees. A complete list of accession IDs and relevant metadata of the dataset is available in
00-helperfiles/metadata.txt
.
Dependencies
The pipeline is built using snakemake and conda and it will be assumed that users have git , conda , and Snakemake installed. The main dependencies of this workflow include:
-
sra-toolkit=2.11.0
-
qiime2=2021.8.0
-
fastqc=0.11.9
-
multiqc=1.11
-
r-base=4.0.5
-
phyloseq=1.34.0
-
r-vegan=2.5_7
Other minor dependencies are listed in
spec-file.txt
.
Workflow
An overview of the workflow could be represented using a Directed Acyclic graph as shown below:
The entire workflow with the described dataset takes ~1.5 hours to run. The main steps of the workflow include:
-
Download raw data : This step needs a list of accession numbers to download raw paired-end data from the Sequence Read Archive (SRA) using
pre-fetch
andfasterq-dump
commands from thesra-toolkit
. -
Initial quality control :
FastQC
andMultiQC
analyze fastq files for quality checks. -
Import data into QIIME2 : The fastq files will then be imported into a QIIME artifact (qza). After importing data, the pipeline will remove all raw data files downloaded from the SRA to optimize memory usage. Therefore, these raw fastq files will not be accessible after QIIME2 imports the data.
-
Quality control using QIIME2 : The paired-end data artifact generated is then summarized using a QIIME2 visualization (qzv).
-
Generation of ASVs : Using the DADA2 plugin in QIIME2, the paired-end data is denoised and generates an amplicon sequence variant (ASV) table.
-
Taxonomic classification : The ASV table from DADA2 is passed into a QIIME2 pre-built Naive Bayes classifier to assign taxonomic labels to the generated ASVs. This pre-built classifier is trained on Greengenes 13_8 99% OTUs full-length sequences.
-
Phylogenetics : The representative sequences for the ASVs then undergo sequence alignment using then MAFFT plugin in QIIME2 to generate aligned sequences, unrooted, and rooted trees in Newick tree format.
-
Visual analysis : R packages such as Phyloseq and vegan use the ASV table, representative sequences, rooted tree, and taxonomic classification table to demonstrate differences/similarities in microbiome composition between samples.
Usage
- Clone this repository move into the cloned directory.
git clone https://github.com/dollinad/BIOF501TermProject.git
cd BIOF501termProject
- Create and activate a conda environment named microbiome using a specification file. This environment will consist of the required software and packages for the pipeline to run.
conda create --name microbiome --file spec-file.txt
conda activate microbiome
- Finally, run the pipeline using the following command that allows parallelization of tasks by assigning four cores for the workflow to use.
snakemake --use-conda --cores 4
In one chunk, this is:
git clone https://github.com/dollinad/BIOF501TermProject.git
cd BIOF501termProject
conda create --name microbiome --file spec-file.txt
conda activate microbiome
snakemake --use-conda --cores 4
Input
The workflow needs the following files in the
00-helperfiles
as input:
-
accessionList.txt
: A list of run accessions to download from SRA.SRR6920043 SRR6920044 ''' -
metadata.txt
: A tab-delimited text file containing three columns: SRA run ID, condition 1, and variable 1.id gynecologic_disord ethnicity SRR6920043 post-radiotherapy African-american SRR6920044 pre-radiotherapy African-american ''' ''' ''' -
manifest.txt
: A tab-delimited text file containing three columns: SRA run ID, the relative path for forward and reverse reads.sample-id forward-absolute-filepath reverse-absolute-filepath SRR6920043 01-data/SRR6920043_1.fastq 01-data/SRR6920043_2.fastq SRR6920044 01-data/SRR6920044_1.fastq 01-data/SRR6920044_2.fastq ''' ''' ''' -
gg-13-8-99-nb-classifier.qza
: A pre-built classifier from QIIME2.
***Note:****Even though QIIME2 and the column name in
manifest.txt
need absolute paths to forward and reverse reads, it is sufficient to provide a relative path as the pipeline will automatically change this file to include an absolute path.
Expected output
The visual results from the pipeline can be found in the folder
05-visuals/
and are generated using the
R_analysis.R
script under the
scripts/
folder. The pipeline will generate the following results:
-
Quality control: Quality checks were performed using FastQC and MultiQC and the results from these tools could be found in
02/fastqc/
and03/multiqc/
, respectively. Both FastQC and MultiQC generate an html report that could be viewed through a browser. Additional quality control was performed using QIIME2 and a corresponding QIIME visualization was generated which could be found in05-visuals/
. To view a QIIME2 visual, use the following command:
qiime tools view paired-end-data.qzv
The visualization should open in the default browser. As shown in the example below, this visual helps users to trim reads keepng only high quality bases by remving primers and other sequencing artifacts.
-
05-visuals/01-rarefaction.png
: Shows the number of species observed at various sequence depths.
-
05-visuals/02-alpha-diversity.png
: Generated using raw ASV table (not rarefied) and displays differences between the two groups (pre- and post- radiation). The figure shows that the number of species (species richess) is different for pre- and post- radiation groups.
-
05-visuals/relative_abundance.png
: Generated using rarefied data at 90% of minimum sample depth and shows the relative abundance of phylum in the two groups. The figure shows that there is an enrichment of Tenericutes and Fusobacteria in pre-radiation and an increase of Spirochaetes post-radiation.
-
05-visuals/04-beta_diversity.png
: Multidimensional scaling (MDS) on various distance metrics to measure the dissimilarity between the pre- and post- radiation samples.
-
05-visuals/05-heirarchal_clustering.png
: Generated using the Bray-Curtis distance method and ard's linkage method for heirarchal cluster analysis. The figure was generated using teh vegan package in R.
-
05-visuals/06-phylogenetics_tree.png
: A phylogenetic tree using fifty most abundant taxa.
Other intermediate output files that could be used for additional downstream analysis include
04-qiime2/table.qza
(ASV table),
04-qiime2/rep-seqs.qza
(representative sequences for ASVs),
04-qiime2/aligned-rep-seqs.qza
,
04-qiime2/rooted-tree.qza
,
04-qiime2/unrooted-tree.qza
.
Since the above-mentioned files are QIIME2 artifacts, they need to be exported from the compressed object before analysis. For example, to export the ASV table use the following command.
qiime tools export --input-path 04-qiime2/table.qza --output-path outputDir/
References
-
Yang, Jiqiao, et al. "Gastrointestinal microbiome and breast cancer: correlations, mechanisms and potential clinical implications." Breast Cancer 24.2 (2017): 220-228.
-
Tsementzi, Despina, et al. "Comparison of vaginal microbiota in gynecologic cancer patients pre‐and post‐radiation therapy and healthy women." Cancer medicine 9.11 (2020): 3714-3724.
Code Snippets
71 72 | shell: "fastqc {input} -o {fastqcOut} -q -t 4" |
84 85 | wrapper: "v0.80.1/bio/multiqc" |
104 105 106 107 108 109 110 | shell: """ qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path {input.manFile} --output-path {output} \ --input-format PairedEndFastqManifestPhred33V2 rm -r '01-data/' && touch rawDataRemoved.txt """ |
121 122 123 124 125 | shell: """ qiime demux summarize --i-data {input} \ --o-visualization {output} """ |
148 149 150 151 152 153 154 155 | shell: "qiime dada2 denoise-paired --i-demultiplexed-seqs {input.data} \ --p-n-threads 8 \ --p-trim-left-f 20 --p-trim-left-r 20 \ --p-trunc-len-f 275 --p-trunc-len-r 275 \ --o-table {output.table} \ --o-representative-sequences {output.rep_seqs} \ --o-denoising-stats {output.denoise_stats}" |
169 170 171 172 173 174 175 176 | shell: "qiime phylogeny align-to-tree-mafft-fasttree \ --p-n-threads 8 \ --i-sequences {input} \ --o-alignment {output.aligned_rep_seqs} \ --o-masked-alignment {output.masked_aligned_rep_seqs} \ --o-tree {output.unrooted_tree} \ --o-rooted-tree {output.rooted_tree}" |
189 190 191 192 193 | shell: "qiime feature-classifier classify-sklearn \ --i-classifier {input.inClassifier} \ --i-reads {input.reads} \ --o-classification {output}" |
206 207 208 209 210 211 212 213 214 215 216 | shell: """ qiime tools export --input-path {input.table} --output-path {output} biom convert -i {output}/feature-table.biom -o {output}/otu_table.tsv --to-tsv && cd {output} sed -i .bak '1d' otu_table.tsv sed -i .bak 's/#OTU ID//' otu_table.tsv cd ../ qiime tools export --input-path {input.rep_seqs} --output-path {output} qiime tools export --input-path {input.tax} --output-path {output} qiime tools export --input-path {input.tree} --output-path {output} """ |
232 233 | shell: "Rscript {input.script} > Rlog.txt" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
Support
- Future updates
Related Workflows





