Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This is the Cancer Genomics Research Laboratory's (CGR) microbiome analysis pipeline. This pipeline utilizes QIIME2 to classify sequence data, calculate relative abundance, and perform alpha- and beta-diversity analysis.
How to run
Input requirements
-
Manifest file
- For external (non-CGR-produced data) runs, the following columns are required:
#SampleID Run-ID Project-ID fq1 fq2
- For internal runs, the following columns are required:
#SampleID Run-ID Project-ID
- See the template manifest files in this repo for examples
-
config.yaml
-
(for production runs) run_pipeline.sh
Options to run the pipeline (choose one)
A. Production run: Copy
run_pipeline.sh
and
config.yaml
to your directory, edit as needed, then execute the script.
B. For dev/testing only: Copy and edit
config.yaml
, then run the snakefile directly, e.g.:
module load slurm singularity perl/5.36 bbmap python3/3.10.2
conf=${PWD}/config.yml snakemake -s /path/to/pipeline/Snakefile
Configuration details
-
metadata_manifest: full path to manifest file
-
out_dir: full path to desired output directory (note that CGR production runs are stored at
/DCEG/Projects/Microbiome/Analysis/
) -
exec_dir: full path to pipeline (e.g. Snakefile)
-
fastq_abs_path: full path to fastqs
-
temp_dir: full path to temp/scratch space
-
qiime2_sif: singularity image for qiime2_2019.1
-
reference_db: list classifiers (1+) to be used for taxonomic classification; be sure to match trained classifiers with correct qiime version
-
cluster_mode: options are
'sbatch/etc ...'
,'local'
,'dryrun'
,'unlock'
-
Example for ccad2:
'sbatch -p bigmemq --mem=100g --time=24:00:00 --output=/path/to/project/directory/logs/slurm-%j.out --cpus-per-task={threads}'
-
Example for ccad2:
Workflow summary
- Manifest management:
-
Manifest provided in config.yaml is checked for compliance with QIIME2 specifications
-
Per-flow cell manifests are created
-
Symlink fastqs to be viewable by DCEG PIs
-
Import and demultiplex fastqs
-
For external data only, fix any unmatched paired reads (can be generated by fastq QC process)
-
Denoise
-
Merge feature and sequence tables across flow cells; drop samples with zero reads
-
Build multiple sequence alignment, then build rooted and unrooted phylogenetic trees
-
Perform alpha- and beta-diversity analysis, rarefaction, and taxonomic classification
Example output directory structure
-
Within parent directory
<out_dir>/
defined in config.yaml TODO: Needs updating!
.
├── LICENSE.md
├── README.md
├── bacteria_only
│ ├── feature_tables
│ └── sequence_tables
├── config
│ └── config.yaml
├── config.yaml
├── denoising
│ ├── feature_tables
│ ├── sequence_tables
│ └── stats
├── diversity_core_metrics
├── fastqs
├── import_and_demultiplex
├── logs
│ ├── slurm-136008.out
│ ├── ...
│ └── slurm-136610.out
├── manifests
│ └── manifest_qiime2.tsv
├── phylogenetics
│ ├── masked_msa.qza
│ ├── msa.qza
│ ├── rooted_tree.qza
│ └── unrooted_tree.qza
├── read_feature_and_sample_filtering
│ ├── feature_tables
│ └── sequence_tables
├── report
├── run_pipeline.sh
├── run_times
├── snakemake-136007.err
├── snakemake-136007.out
├── taxonomic_classification
│ ├── gg-13-8-99-515-806-nb-classifier
│ │ ├── barplots.qzv
│ │ ├── barplots_data_files
│ │ │ └── taxonomy.tsv
│ │ ├── taxa.qza
│ │ └── taxa.qzv
│ └── silva-132-99-515-806-nb-classifier
│ ├── barplots.qzv
│ ├── barplots_data_files
│ │ └── taxonomy.tsv
│ ├── taxa.qza
│ └── taxa.qzv
├── taxonomic_classification_bacteria_only
│ ├── gg-13-8-99-515-806-nb-classifier
│ │ ├── barplots.qzv
│ │ ├── barplots_data_files
│ │ │ └── taxonomy.tsv
│ │ └── taxa.qza
│ └── silva-132-99-515-806-nb-classifier
│ ├── barplots.qzv
│ ├── barplots_data_files
│ │ └── taxonomy.tsv
│ └── taxa.qza
└── workflow ├── Snakefile ├── rules └── scripts ├── Q2Manifest.pl ├── Q2Manifest.t └── qQ2_wrapper.sh
QC report
Input requirements
The manifest requirements for the report are the same as for the pipeline itself; however, including the following in your manifest will result in a more informative QC report:
-
To be detected as blanks, water blanks and no-template control sample IDs must contain the string "water" or "ntc" in the "#SampleID" column; this is not case sensitive.
-
"Source PCR Plate" column: header is case insensitive, can have spaces or not. For the report, only the first characters preceding an underscore in this column will be preserved; this strips CGR's well ID from the string.
-
"Sample Type" column: header is case insensitive, can have spaces or not. Populate with any string values that define useful categories, e.g. water, NTC_control, blank, study sample, qc, etc.
-
"External ID" column: header is case insensitive, can have spaces or not. This column is used at CGR to map IDs of received samples to the LIMS-generated IDs we use internally. For technical replicates, the sample ID will be unique (as required by QIIME), but the external ID can be the same to link the samples for comparison in the report.
-
Note that the report assumes that the sequencer ID is the second underscore-delimited field in the run ID; this may not be meaningful if your run ID does not conform.
Running the report
For CGR production runs, after the pipeline completes, generate a QC report using the Jupyter notebook in the
report/
directory.
-
Open the notebook (see below for details) and change the
proj_dir
variable in section 1.1 -
Run the complete notebook
-
Save the report as html with the code hidden (see below for details)
Running jupyter notebooks at CGR
To run jupyter notebooks on the CGR HPC, login to the cluster, navigate to the notebook, then run the following. You can set the port to anything above 8000.
module load python3
jupyter notebook --no-browser --port=8080
Then, on your local machine, run the following to open up an ssh tunnel:
ssh -N -L 8080:localhost:8080 <username@cluster.domain>
Finally, open your browser to the URL given by the
jupyter notebook
command above (
https://localhost:8080/?token=<token>
).
To save the report
-
Clone the following repo:
https://github.com/ihuston/jupyter-hide-code-html
-
Run in terminal:
jupyter nbconvert --to html --template jupyter-hide-code-html/clean_output.tpl path/to/CGR_16S_Microbiome_QC_Report.ipynb
-
Name the above file
NP###_pipeline_run_folder_QC_report.html
and place it in the directory with the pipeline output
For development
Jupyter notebooks are stored as JSON with metadata, which can result in very messy diffs/merge requests in version control. Please do the following to create a clean python script version of the notebook to be used in diffs.
-
After making and testing changes, Kernel > Restart & Clear Output
-
Run in terminal:
jupyter nbconvert --to script CGR_16S_Microbiome_QC_Report.ipynb
-
Add and commit both CGR_16S_Microbiome_QC_Report.ipynb AND CGR_16S_Microbiome_QC_Report.py
Notes
-
Samples are run at a flowcell level, due to DADA2 run requirements. The algorithm that DADA2 uses includes an error model that assumes one sequencing run. The pitfall of merging them together prior to running DADA2 is that a lower-quality run (but still passing threshold) may have samples thrown out because they are significantly lower than a high performing run.
-
After creating the QIIME2 manifest file, www.keemi.qiime2.org can be used from Google Chrome to verify the manifest is in the correct format.
Code Snippets
237 238 239 | shell: 'dos2unix -n {input} {output};' 'perl {params.e}workflow/scripts/Q2Manifest.pl {output}' |
254 255 256 | shell: 'ln -s {input.fq1} {output.sym1};' 'ln -s {input.fq2} {output.sym2}' |
273 274 275 276 277 278 | shell: 'if [[ $(zcat {input} | head -n1 | cut -f1 -d\":\") =~ " " ]]; then \ zcat {input} | awk \'{{if (NR % 4 == 1) {{n=split($0, arr, " "); split(arr[2],tag,":"); printf "@%s ", arr[2]; for (i=3; i<=n; i++) printf "%s ",arr[i]; printf "orig_header=@%s %s\\n",substr(arr[1],2,length(arr[1])-1),tag[1]}} else {{print $0}}}}\' | gzip -c > {output}; \ else \ ln -s {input} {output}; \ fi' |
294 295 296 297 298 299 | shell: 'if [[ $(zcat {input} | head -n1 | cut -f1 -d\":\") =~ " " ]]; then \ zcat {input} | awk \'{{if (NR % 4 == 1) {{n=split($0, arr, " "); split(arr[2],tag,":"); printf "@%s ", arr[2]; for (i=3; i<=n; i++) printf "%s ",arr[i]; printf "orig_header=@%s %s\\n",substr(arr[1],2,length(arr[1])-1),tag[1]}} else {{print $0}}}}\' | gzip -c > {output}; \ else \ ln -s {input} {output}; \ fi' |
315 316 317 318 319 | shell: 'repair.sh in1={input.fq1} in2={input.fq2} out1={params.fq1} out2={params.fq2} outs={params.single} repair;' 'gzip {params.fq1};' 'gzip {params.fq2};' 'gzip {params.single}' |
345 346 347 | shell: 'echo "{wildcards.sample},{input.fq1},forward,{params.runID}" > {output};' 'echo "{wildcards.sample},{input.fq2},reverse,{params.runID}" >> {output}' |
360 361 | shell: 'find {params} -maxdepth 1 -name \'*Q2_manifest_by_sample.txt\' | xargs cat > {output}' |
372 373 | shell: 'awk \'BEGIN{{FS=OFS=","; print "sample-id,absolute-filepath,direction"}}$4=="{wildcards.runID}"{{print $1,$2,$3}}\' {input} > {output}' |
398 399 400 401 402 403 | shell: 'qiime tools import \ --type {params.in_type} \ --input-path {input} \ --output-path {output} \ --input-format {params.in_format}' |
421 422 423 424 | shell: 'qiime demux summarize \ --i-data {input} \ --o-visualization {output}' |
450 451 452 453 454 455 456 457 458 459 460 461 462 | shell: 'qiime dada2 denoise-paired \ --verbose \ --p-n-threads {threads} \ --i-demultiplexed-seqs {input.qza} \ --o-table {output.features} \ --o-representative-sequences {output.seqs} \ --o-denoising-stats {output.stats} \ --p-trim-left-f {params.trim_l_f} \ --p-trim-left-r {params.trim_l_r} \ --p-trunc-len-f {params.trun_len_f} \ --p-trunc-len-r {params.trun_len_r} \ --p-min-fold-parent-over-abundance {params.min_fold}' |
479 480 481 482 | shell: 'qiime metadata tabulate \ --m-input-file {input} \ --o-visualization {output}' |
519 520 521 522 523 524 525 526 527 528 529 530 | shell: """ total_runs=`awk -F "\\t" -v col=Run-ID 'NR==1{{for(i=1;i<=NF;i++){{if($i==col){{c=i;break}}}} print $c}} NR>1{{print $c}}' {input.q2_manifest} \ |sed 1d |sort|uniq|wc -l`; if [ $total_runs -eq 1 ]; then cp {input.feature_tables} {output} else tables=`awk -F "\\t" -v col=Run-ID 'NR==1{{for(i=1;i<=NF;i++){{if($i==col){{c=i;break}}}} print $c}} \ NR>1{{print "--i-tables denoising/feature_tables/"$c".qza"}}' {input.q2_manifest} |sed 1d |sort|uniq`; qiime feature-table merge $tables --o-merged-table {output} fi """ |
555 556 557 558 559 560 561 562 563 564 565 566 567 568 | shell: """ total_runs=`awk -F "\\t" -v col=Run-ID 'NR==1{{for(i=1;i<=NF;i++){{if($i==col){{c=i;break}}}} print $c}} NR>1{{print $c}}' {input.q2_manifest} \ |sed 1d |sort|uniq|wc -l`; echo $total_runs; if [ $total_runs -eq 1 ]; then cp {input.seqs} {output} else tables=`awk -F "\\t" -v col=Run-ID 'NR==1{{for(i=1;i<=NF;i++){{if($i==col){{c=i;break}}}} print $c}} \ NR>1{{print "--i-data denoising/sequence_tables/"$c".qza"}}' {input.q2_manifest} |sed 1d |sort|uniq`; echo $tables qiime feature-table merge-seqs $tables --o-merged-data {output} fi """ |
586 587 588 589 590 | shell: 'qiime feature-table filter-samples \ --i-table {input} \ --p-min-frequency {params.f} \ --o-filtered-table {output}' |
605 606 607 608 609 | shell: 'qiime feature-table filter-features \ --i-table {input} \ --p-min-frequency {params.f} \ --o-filtered-table {output}' |
624 625 626 627 628 | shell: 'qiime feature-table filter-features \ --i-table {input} \ --p-min-samples {params.f} \ --o-filtered-table {output}' |
646 647 648 649 650 | shell: 'qiime feature-table filter-samples \ --i-table {input} \ --p-min-features {params.f} \ --o-filtered-table {output}' |
673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 | shell: 'qiime feature-table summarize \ --i-table {input.qza1} \ --o-visualization {output.qzv1} \ --m-sample-metadata-file {input.q2_manifest} && \ qiime feature-table summarize \ --i-table {input.qza2} \ --o-visualization {output.qzv2} \ --m-sample-metadata-file {input.q2_manifest} && \ qiime feature-table summarize \ --i-table {input.qza3} \ --o-visualization {output.qzv3} \ --m-sample-metadata-file {input.q2_manifest} && \ qiime feature-table summarize \ --i-table {input.qza4} \ --o-visualization {output.qzv4} \ --m-sample-metadata-file {input.q2_manifest}' |
707 708 709 710 711 | shell: 'qiime feature-table filter-seqs --i-data {input.seq_table} --i-table {input.feat1} --o-filtered-data {output.seq1} && \ qiime feature-table filter-seqs --i-data {input.seq_table} --i-table {input.feat2} --o-filtered-data {output.seq2} && \ qiime feature-table filter-seqs --i-data {input.seq_table} --i-table {input.feat3} --o-filtered-data {output.seq3} && \ qiime feature-table filter-seqs --i-data {input.seq_table} --i-table {input.feat4} --o-filtered-data {output.seq4}' |
732 733 734 735 736 737 738 739 740 741 742 743 744 | shell: 'qiime feature-table tabulate-seqs \ --i-data {input.qza1} \ --o-visualization {output.qzv1} && \ qiime feature-table tabulate-seqs \ --i-data {input.qza2} \ --o-visualization {output.qzv2} && \ qiime feature-table tabulate-seqs \ --i-data {input.qza3} \ --o-visualization {output.qzv3} && \ qiime feature-table tabulate-seqs \ --i-data {input.qza4} \ --o-visualization {output.qzv4}' |
755 756 757 758 | shell: 'qiime feature-table tabulate-seqs \ --i-data {input} \ --o-visualization {output}' |
770 771 772 773 774 | shell: 'qiime feature-table summarize \ --i-table {input.qza} \ --o-visualization {output} \ --m-sample-metadata-file {input.q2_manifest}' |
810 811 812 813 814 815 | shell: 'qiime feature-classifier {params.c_method} \ --p-n-jobs {threads} \ --i-classifier {input.ref} \ --i-reads {input.seqs} \ --o-classification {output}' |
851 852 853 854 855 856 | shell: 'qiime feature-classifier {params.c_method} \ --p-n-jobs {threads} \ --i-classifier {input.ref} \ --i-reads {input.seqs} \ --o-classification {output}' |
871 872 873 874 | shell: "qiime tools export --input-path {input} --output-path {params} && \ sed 's/ \t/\t/' {output.o1} > {output.o2} && \ qiime tools import --type 'FeatureData[Taxonomy]' --input-path {output.o2} --output-path {output.o3}" |
892 893 894 895 | shell: 'qiime metadata tabulate \ --m-input-file {input} \ --o-visualization {output}' |
915 916 917 918 919 920 | shell: 'qiime taxa barplot \ --i-table {input.seqs} \ --i-taxonomy {input.tax} \ --m-metadata-file {input.manifest} \ --o-visualization {output}' |
940 941 942 943 944 945 | shell: 'qiime taxa barplot \ --i-table {input.seqs} \ --i-taxonomy {input.tax} \ --m-metadata-file {input.manifest} \ --o-visualization {output}' |
967 968 969 | shell: 'qiime tools export --input-path {input.taxonomy_qza} --output-path {params.d}; \ qiime tools export --input-path {input.taxonomy_bar_plots} --output-path {params.d}' |
991 992 993 994 995 996 | shell: 'qiime taxa filter-table \ --i-table {input.seqs} \ --i-taxonomy {input.tax} \ --p-include "D_0__Bacteria;D_1,k__Bacteria; p__" \ --o-filtered-table {output}' |
1022 1023 1024 1025 1026 1027 1028 | shell: 'qiime taxa filter-table \ --i-table {input.features} \ --i-taxonomy {input.tax} \ --p-mode exact \ --p-exclude "k__Bacteria; p__" \ --o-filtered-table {output}' |
1046 1047 1048 1049 1050 | shell: 'qiime feature-table filter-seqs \ --i-data {input.seqs} \ --i-table {input.bacterial_features} \ --o-filtered-data {output}' |
1062 1063 1064 1065 1066 1067 1068 1069 | shell: 'qiime feature-table summarize \ --i-table {input.features} \ --o-visualization {output.features} \ --m-sample-metadata-file {input.q2_manifest} && \ qiime feature-table tabulate-seqs \ --i-data {input.seqs} \ --o-visualization {output.seqs}' |
1093 1094 1095 1096 1097 1098 1099 | shell: 'qiime phylogeny align-to-tree-mafft-fasttree \ --i-sequences {input} \ --o-alignment {output.msa} \ --o-masked-alignment {output.masked_msa} \ --o-tree {output.unrooted_tree} \ --o-rooted-tree {output.rooted_tree}' |
1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 | shell: 'qiime diversity core-metrics-phylogenetic \ --i-phylogeny {input.rooted_tree} \ --i-table {input.features} \ --p-sampling-depth {params.samp_depth} \ --m-metadata-file {input.q2_manifest} \ --o-rarefied-table {output.rare} \ --o-faith-pd-vector {output.faith} \ --o-observed-otus-vector {output.obs} \ --o-shannon-vector {output.shan} \ --o-evenness-vector {output.even} \ --o-unweighted-unifrac-distance-matrix {output.unw_dist} \ --o-unweighted-unifrac-pcoa-results {output.unw_pcoa} \ --o-unweighted-unifrac-emperor {output.unw_emp} \ --o-weighted-unifrac-distance-matrix {output.w_dist} \ --o-weighted-unifrac-pcoa-results {output.w_pcoa} \ --o-weighted-unifrac-emperor {output.w_emp} \ --o-jaccard-distance-matrix {output.jac_dist} \ --o-jaccard-pcoa-results {output.jac_pcoa} \ --o-jaccard-emperor {output.jac_emp} \ --o-bray-curtis-distance-matrix {output.bc_dist} \ --o-bray-curtis-pcoa-results {output.bc_pcoa} \ --o-bray-curtis-emperor {output.bc_emp}' |
1196 1197 1198 1199 1200 1201 1202 | shell: 'qiime metadata tabulate \ --m-input-file {input.obs} \ --m-input-file {input.shan} \ --m-input-file {input.even} \ --m-input-file {input.faith} \ --o-visualization {output}' |
1227 1228 1229 1230 1231 1232 1233 | shell: 'qiime diversity alpha-rarefaction \ --i-table {input.features} \ --i-phylogeny {input.rooted} \ --p-max-depth {params.m_depth} \ --m-metadata-file {input.q2_manifest} \ --o-visualization {output}' |
1251 1252 1253 1254 | shell: 'qiime tools export --input-path {input.table_dada2_qza} --output-path {params.out1}; \ biom convert -i {output.table_dada2_biom} -o {output.table_dada2_biom_tsv} --to-tsv; \ qiime tools export --input-path {input.repseq_dada2_qza} --output-path {params.out2}' |
1272 1273 1274 1275 | shell: 'qiime tools export --input-path {input.table_dada2_qza} --output-path {params.out1}; \ biom convert -i {output.table_dada2_biom} -o {output.table_dada2_biom_tsv} --to-tsv; \ qiime tools export --input-path {input.repseq_dada2_qza} --output-path {params.out2}' |
Support
- Future updates
Related Workflows





