vRAPID: Virus Reference-based Assembly and Identification Pipeline

Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
vRAPID
The Virus Reference-based Assembly Pipeline and IDentification (vRAPID) tool runs the assembly, consensus calling and annotation of viral pathogen genomes.
vRAPID relies on data structures and naming conventions used by the Center for Research on Influenza Pathogenesis and Transmission (CRIPT) and the Mount Sinai Pathogen Surveillance Program (MS-PSP) at the Icahn School of Medicine at Mount Sinai. vRAPID is an expansion of the COVID_pipe pipeline that was written by Mitchell Sullivan in the Bakel Lab.
Prepare the snakemake conda environment
Installation of the required external software packages is largely handled by the pipeline itself, however a conda environment named
snakemake
needs to be present in your environment. We recommend miniconda, which is a free minimal installer for
conda
. Follow the instructions below to start the miniconda installer on Linux. When asked whether the conda environment should automatically be initialized, select 'yes'. Note that Snakemake requires the channel_priority to be set to strict. The post-installation commands to apply this setting are included in the post-installation selection below.
# Start miniconda installation
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
# Post-installation commands to enforce strict channel_priority (required for Snakemake)
conda config --set auto_activate_base false
conda config --set channel_priority strict
On the Mount Sinai
Minerva
high-performance computer cluster we additionally recommend setting the conda environment
pkgs
and
envs
directories to your
work
folder, which has a much larger storage allocation (100-200 Gb) than the home folders (5 Gb). The conda configuration commands to set the directories are as follows:
mkdir -p /sc/arion/work/$USER/conda/pkgs
mkdir -p /sc/arion/work/$USER/conda/envs
conda config --add pkgs_dirs "/sc/arion/work/$USER/conda/pkgs"
conda config --add envs_dirs "/sc/arion/work/$USER/conda/envs"
After installation and configuration of conda/miniconda, the following 'conda create' command can be used to set up the required 'snakemake' environment.
conda create -c conda-forge -c bioconda -n snakemake 'snakemake=6.8.0' 'mamba=0.24' 'tabulate=0.8'
Running the vRAPID pipeline
The following folder structure should exist in the analysis directory to run the pipeline:
<Run-ID>
├── <virus1>
│ ├── <expect_subtype>
│ │ ├── <batch_ID1>
│ │ │ ├── <sample_ID1>
│ │ │ │ ├── <sample_ID1>
│ │ │ │ │ ├── <sample_ID1>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID1>_2.fastq.gz
│ │ │ ├── <sample_ID2>
│ │ │ │ ├── <sample_ID2>
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
│ │ ├── <batch ID2>
│ │ │ ├── <sample_ID1>
│ │ │ │ ├── <sample_ID1>
│ │ │ │ │ ├── <sample_ID1>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID1>_2.fastq.gz
│ │ │ ├── <sample_ID2>
│ │ │ │ ├── <sample_ID2>
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
├── <virus2>
│ ├── <expect_subtype>
│ │ ├── <batch_ID1>
│ │ │ ├── <sample_ID1>
│ │ │ │ ├── <sample_ID1>
│ │ │ │ │ ├── <sample_ID1>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID1>_2.fastq.gz
│ │ │ ├── <sample_ID2>
│ │ │ │ ├── <sample_ID2>
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
│ │ ├── <batch_ID2>
│ │ │ ├── <sample_ID1>
│ │ │ │ ├── <sample_ID1>
│ │ │ │ │ ├── <sample_ID1>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID1>_2.fastq.gz
│ │ │ ├── <sample_ID2>
│ │ │ │ ├── <sample_ID2>
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
Then within each sample's subdirectory, create the following two files:
-
samples.csv:
with the column nameSample_ID
, holding a list of the samples within each virus-batch subdirectory (the new working directory). -
multibamqc_input.txt:
a tab-separated file that holds the list of samples fromsamples.csv
in the first column, and the path to the bam file in the second column. Note that the bam files will always be outputted to<sample_ID>/02_assembly/<sample_ID>_ref.bam
. Please note that themultibamqc_input.txt
file does not require column names.
An example of both file structures can be seen below:
samples.csv
Sample_ID
sample_ID1
sample_ID2
sample_ID3
multibamqc_input.txt
sample_ID1 sample_ID1/02_assembly/sample_ID1_ref.bam
sample_ID2 sample_ID2/02_assembly/sample_ID2_ref.bam
sample_ID3 sample_ID3/02_assembly/sample_ID3_ref.bam
The final structure should be as follows:
<Run-ID>
├── <virus1>
│ ├── <expect_subtype>
│ │ ├── <batch_ID1>
│ │ │ ├── samples.csv
│ │ │ ├── multibamqc_input.txt
│ │ │ ├── <sample_ID1>
│ │ │ │ ├── <sample_ID1>
│ │ │ │ │ ├── <sample_ID1>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID1>_2.fastq.gz
│ │ │ ├── <sample_ID2>
│ │ │ │ ├── <sample_ID2>
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
│ │ ├── <batch_ID2>
│ │ │ ├── samples.csv
│ │ │ ├── multibamqc_input.txt
│ │ │ ├── <sample_ID1>
│ │ │ │ ├── <sample_ID1>
│ │ │ │ │ ├── <sample_ID1>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID1>_2.fastq.gz
│ │ │ ├── <sample_ID2>
│ │ │ │ ├── <sample_ID2>
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
├── <virus2>
│ ├── <expect_subtype>
│ │ ├── <batch_ID1>
│ │ │ ├── samples.csv
│ │ │ ├── multibamqc_input.txt
│ │ │ ├── <sample_ID1>
│ │ │ │ ├── <sample_ID1>
│ │ │ │ │ ├── <sample_ID1>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID1>_2.fastq.gz
│ │ │ ├── <sample_ID2>
│ │ │ │ ├── <sample_ID2>
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
│ │ ├── <batch_ID2>
│ │ │ ├── samples.csv
│ │ │ ├── multibamqc_input.txt
│ │ │ ├── <sample_ID1>
│ │ │ │ ├── <sample_ID1>
│ │ │ │ │ ├── <sample_ID1>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID1>_2.fastq.gz
│ │ │ ├── <sample_ID2>
│ │ │ │ ├── <sample_ID2>
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID2>_1.fastq.gz
Lastly, to successfully run the pipeline, depending on the virus, the
config.yaml
file within the pipeline repository might need to be edited. Please note that the current default settings are set for SARS-CoV-2. The following fields are required within the file:
-
samples : Name of the file containing a list of the samples being run. Currently set to
samples.csv
. -
run_id: Run ID name, typically identified as
TD######
from the sequencing core's data release e-mail. -
reference: path to the reference within the
db
directory that is found in the repository. Currently set tosars-cov-2/COVID.fa
. -
genbank: path to the GenBank file within the
db
directory that is found in the repository. Currently set tosars-cov-2/COVID.gbk
. -
reverse_primer: path to the 3-prime primer file within the
db
directory that is found in the repository. Currently set tosars-cov-2/SARS-CoV-2_primers_3prime_NI.fa
. -
forward_primer: path to the 5-prime primer file within the
db
directory that is found in the repository. Currently set tosars-cov-2/SARS-CoV-2_primers_5prime_NI.fa
. -
primer_set_2kb: path to the 2kb-prime set file within the
db
directory that is found in the repository. Currently set tosars-cov-2/SARS-CoV-2_primers_2kb_set.tsv
. -
primer_set_1_5kb: path to the 1.5kb-prime set file within the
db
directory that is found in the repository. Currently set tosars-cov-2/SARS-CoV-2_primers_1.5kb_set.tsv
. -
virus: the virus name for the samples being run. Currently set to
SARS-CoV-2
. Other options include:Influenza-A
,Influenza-B
,sCoV
,MPX
. -
path: the full path to where the samples being run are located. See above for the proper structure.
-
length: length of the reference genome(s). For multi-segmented viruses like influenza, this can be a list of lengths, ordered with respect to the FASTA headers.
-
ref_fasta_headers: The name of FASTA header(s) in the reference genome. For multi-segmented viruses like influenza, this can be a list of headers, ordered with respect to the length.
Note for users from the Bakel lab only
From a typical data release from the sequencing core the starting directory structure will look as follows:
<Run-ID>
└───<sample_ID1>
│ │ <sample_ID1>_1.fastq.gz
│ │ <sample_ID1>_2.fastq.gz
│
└───<sample_ID2>
│ <sample_ID2>_1.fastq.gz
│ <sample_ID2>_2.fastq.gz
In order to create the required directory structure for the pipeline run:
python ~/opt/vRAPID/workflow/scripts/organize_run_samples.py
This script queries to PathogenDB, separating the samples into subdirectories based on the virus type, expected subtype, and batch ID. If no subtype is expected, then the script separates the samples into a
None
subdirectory. This creates the final directory structure within the sequencing run directory as shown before. You can then proceede to generate the
samples.csv
and
multibamqc_input.txt
as previously mentioned.
Running vRAPID
Then once the files are generated, the pipeline can be run using the following command:
submitjob 12 -c 4 -m 64 -q private -P acc_PVI ~/opt/vRAPID/run-vRAPID-analysis -i <run-ID> -p <path> -s <samples-run-csv>
Note that the
<run-ID>
,
<path>
, and
<samples-run-csv>
arguments in the command above are optional. If they are not supplied, then they are pulled from the
config.yaml
file.
vRAPID output
Once the pipeline has completed, the following output files are created:
<Run-ID>
├── <virus1>
│ ├── <expect_subtype>
│ │ ├── <batch_ID1>
│ │ │ ├── samples.csv
│ │ │ ├── multibamqc_input.txt
│ │ │ ├── pipeline.log
│ │ │ ├── file_movement_message.txt
│ │ │ ├── <Run-ID>_cleaned.txt
│ │ │ ├── <Run-ID>_data_release.txt
│ │ │ ├── <Run-ID>_software_version.txt
│ │ │ ├── <Run-ID>_<batch-ID>_assemblies.csv
│ │ │ ├── <Run-ID>_<batch-ID>.tar.gz
│ │ │ ├── logs
│ │ │ ├── multi_bamqc
│ │ │ ├── multiqc_report.html
│ │ │ ├── <sample_ID>
│ │ │ │ ├── 01_fastqs
│ │ │ │ │ ├── <sample_ID>_1.fastq.gz
│ │ │ │ │ ├── <sample_ID>_2.fastq.gz
│ │ │ │ ├── 02_assembly
│ │ │ │ ├── 03_qualityControl
│ │ │ │ ├── 04_variants
│ │ │ │ ├── 05_status
Accessory vRAPID pipeline scripts
The following is a brief description of the scripts that are called within the
Snakefile
. For more information, please see the function overview below:
-
run_pipeline.py : Takes in Illumina paired-end FASTQ files and assembles them against a reference genome, outputting a final
FASTA
consensus genome sequence and aligned readbam
file, and other intermediary output files. -
variant_analysis.py : Takes in the consensus FASTA file generated by
run_pipeline.py
and performs an intra-host variant analysis on the virus library. -
run_QC.py : Takes in the
bam
file generated as an output from therun_pipeline.py
script, and performs a quality control analysis on the virus library. -
run-genome-annotation.py : Takes in the
FASTA
file generated fromrun_pipeline.py
and annotates the consensus according to the virus type. For Inluenza samples, this is done using NCBI's Influenza Annotation Tool . For SARS-CoV-2, this is done usingvadr
. -
all-virus_assembly_push.py : Used to upload viral genome assembly data to the Mount Sinai Pathogen Surveillance Program database (PathogenDB).
-
move_QC_PDB.py : Uploads the output that is obtained by running
multiqc
, andqualimap
to the Mount Sinai Pathogen Surveillance Database (PathogenDB). -
cleanup.py : Removes large intermediary output files once the pipeline has completed to conserve space before archival.
-
generate_run_report.R : Creates a final assembly run status and quality report for the samples that were processed within the run, outputting a CSV file.
-
data-release.py : Prepares vRAPID pipeline outputs for data release.
Code Snippets
36 37 38 39 | shell: """ mv {input.sample_folder} {output.renamed_file} > {log.rename} 2>&1 """ |
63 64 65 66 | shell: """ python {params.pipeline} -rd {params.repo_dir} -i {input.sample_folder} -r1 _R1_001.fastq.gz -r2 _R2_001.fastq.gz -r {params.reference} -pf {params.forward_primer} -pr {params.reverse_primer} -g {params.genbankfile} -l {params.length} -headers {params.ref_headers} > {log.assembly} 2>&1 """ |
86 87 88 89 | shell: """ python {params.var_analysis} -rd {params.repo_dir} -i {input.sample_folder} -r {params.reference} -ps_2 {params.ps_2kb} -ps_1_5 {params.ps_1_5kb} > {log.varanalysis} 2>&1 """ |
114 115 116 117 | shell: """ python {params.QC} -rd {params.repo_dir} -i {input.sample_folder} -kdb /sc/arion/projects/PVI/db/minikraken_8GB_20200312 -r {params.reference} -pf {params.forward_primer} -pr {params.reverse_primer} -v {params.virus} -pc {params.plotcoverage} -tb {params.breakdown} > {log.QC} 2>&1 """ |
133 134 135 136 | shell: """ python {params.annotation_script} -i {input.consensus_fasta_file} -v {params.virus} > {log.annotation} 2>&1 """ |
155 156 157 158 159 | shell: """ python {params.pdbpush} -s {params.sample_name} -r {params.run_ID} -c {params.config} -p {params.path} > {log.PDBPush} 2>&1 touch {output.pdb_upload_check} """ |
173 174 175 176 177 | shell: """ module load qualimap qualimap multi-bamqc -d multibamqc_input.txt -r > {log.qualimap} 2>&1 """ |
190 191 192 193 | shell: """ multiqc . > {log.multiqc} 2>&1 """ |
209 210 211 212 | shell: """ python {params.push_script} -r {params.run_ID} > {log.pushQC} 2>&1 """ |
231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 | shell: """ echo "{params.run_ID} was run using the following versions \n" > {output.version_report} echo "\n Name Version Build Channel \n SAMTOOLS: \n" &>> {output.version_report} conda list | grep -w "^samtools" &>> {output.version_report} echo "\n CUTADAPT: \n" &>> {output.version_report} conda list | grep -w "^cutadapt" &>> {output.version_report} echo "\n MINIMAP2: \n" &>> {output.version_report} conda list | grep -w "minimap2" &>> {output.version_report} echo "\n PILON: \n" &>> {output.version_report} conda list | grep "pilon" &>> {output.version_report} echo "\n QUALIMAP: \n" &>> {output.version_report} conda list | grep "qualimap" &>> {output.version_report} echo "\n SHOVILL: \n" &>> {output.version_report} conda list | grep "shovill" &>> {output.version_report} echo "\n PROKKA: \n" &>> {output.version_report} conda list | grep "prokka" &>> {output.version_report} echo "\n KRAKEN2: \n" &>> {output.version_report} conda list | grep "kraken2" &>> {output.version_report} echo "\n PYTHON: \n" &>> {output.version_report} conda list | grep "^python" &>> {output.version_report} echo "\n MULTIQC: \n" &>> {output.version_report} conda list | grep "multiqc" &>> {output.version_report} echo "\n FASTQC: \n" &>> {output.version_report} conda list | grep "fastqc" &>> {output.version_report} echo "\n PICARD Alignment Sumamry Metrics: \n" &>> {output.version_report} conda list | grep "picard" &>> {output.version_report} echo "\n {params.run_ID} was aligned against the following reference: \n {params.reference} \n Using the following genabnk file: \n {params.genbankfile} \n " &>> {output.version_report} echo "\n primer sets used were: \n (1) {params.ps1} \n (2) {params.ps2}" &>> {output.version_report} > {log.versioncheck} 2>&1 """ |
289 290 291 292 293 | shell: """ python {params.cleanup} -p {params.samples_run} > {log.cleanup} 2>&1 touch {output.cleanup_report} """ |
309 310 311 312 | shell: """ Rscript {params.report_script} {params.run_ID} > {log.reporting} 2>&1 """ |
331 332 333 334 335 | shell: """ python {params.status_script} -p {params.sample_mapping} -r {params.run_ID} -v {params.virus} -f {params.ref_header} > {log.datarelease} 2>&1 touch {output.data_release_report} """ |
Support
- Future updates
Related Workflows





