vRAPID: Virus Reference-based Assembly and Identification Pipeline

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

vRAPID

The Virus Reference-based Assembly Pipeline and IDentification (vRAPID) tool runs the assembly, consensus calling and annotation of viral pathogen genomes.

vRAPID relies on data structures and naming conventions used by the Center for Research on Influenza Pathogenesis and Transmission (CRIPT) and the Mount Sinai Pathogen Surveillance Program (MS-PSP) at the Icahn School of Medicine at Mount Sinai. vRAPID is an expansion of the COVID_pipe pipeline that was written by Mitchell Sullivan in the Bakel Lab.

Prepare the snakemake conda environment

Installation of the required external software packages is largely handled by the pipeline itself, however a conda environment named snakemake needs to be present in your environment. We recommend miniconda, which is a free minimal installer for conda . Follow the instructions below to start the miniconda installer on Linux. When asked whether the conda environment should automatically be initialized, select 'yes'. Note that Snakemake requires the channel_priority to be set to strict. The post-installation commands to apply this setting are included in the post-installation selection below.

# Start miniconda installation
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
# Post-installation commands to enforce strict channel_priority (required for Snakemake)
conda config --set auto_activate_base false
conda config --set channel_priority strict

On the Mount Sinai Minerva high-performance computer cluster we additionally recommend setting the conda environment pkgs and envs directories to your work folder, which has a much larger storage allocation (100-200 Gb) than the home folders (5 Gb). The conda configuration commands to set the directories are as follows:

mkdir -p /sc/arion/work/$USER/conda/pkgs
mkdir -p /sc/arion/work/$USER/conda/envs
conda config --add pkgs_dirs "/sc/arion/work/$USER/conda/pkgs"
conda config --add envs_dirs "/sc/arion/work/$USER/conda/envs"

After installation and configuration of conda/miniconda, the following 'conda create' command can be used to set up the required 'snakemake' environment.

conda create -c conda-forge -c bioconda -n snakemake 'snakemake=6.8.0' 'mamba=0.24' 'tabulate=0.8'

Running the vRAPID pipeline

The following folder structure should exist in the analysis directory to run the pipeline:

<Run-ID>
├── <virus1>
│ ├── <expect_subtype>
│ │ ├── <batch_ID1>
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │ ├── <batch ID2>
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
├── <virus2>
│ ├── <expect_subtype>
│ │ ├── <batch_ID1>
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │ ├── <batch_ID2>
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz

Then within each sample's subdirectory, create the following two files:

samples.csv: with the column name Sample_ID , holding a list of the samples within each virus-batch subdirectory (the new working directory).
multibamqc_input.txt: a tab-separated file that holds the list of samples from samples.csv in the first column, and the path to the bam file in the second column. Note that the bam files will always be outputted to <sample_ID>/02_assembly/<sample_ID>_ref.bam . Please note that the multibamqc_input.txt file does not require column names.

An example of both file structures can be seen below:

samples.csv

Sample_ID
sample_ID1
sample_ID2
sample_ID3

multibamqc_input.txt

sample_ID1 sample_ID1/02_assembly/sample_ID1_ref.bam
sample_ID2 sample_ID2/02_assembly/sample_ID2_ref.bam
sample_ID3 sample_ID3/02_assembly/sample_ID3_ref.bam

The final structure should be as follows:

<Run-ID>
├── <virus1>
│ ├── <expect_subtype>
│ │ ├── <batch_ID1>
│ │	│	├── samples.csv
│ │	│	├── multibamqc_input.txt
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │ ├── <batch_ID2>
│ │	│	├── samples.csv
│ │	│	├── multibamqc_input.txt
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
├── <virus2>
│ ├── <expect_subtype>
│ │ ├── <batch_ID1>
│ │	│	├── samples.csv
│ │	│	├── multibamqc_input.txt
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │ ├── <batch_ID2>
│ │	│	├── samples.csv
│ │	│	├── multibamqc_input.txt
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz

Lastly, to successfully run the pipeline, depending on the virus, the config.yaml file within the pipeline repository might need to be edited. Please note that the current default settings are set for SARS-CoV-2. The following fields are required within the file:

samples : Name of the file containing a list of the samples being run. Currently set to samples.csv .
run_id: Run ID name, typically identified as TD###### from the sequencing core's data release e-mail.
reference: path to the reference within the db directory that is found in the repository. Currently set to sars-cov-2/COVID.fa .
genbank: path to the GenBank file within the db directory that is found in the repository. Currently set to sars-cov-2/COVID.gbk .
reverse_primer: path to the 3-prime primer file within the db directory that is found in the repository. Currently set to sars-cov-2/SARS-CoV-2_primers_3prime_NI.fa .
forward_primer: path to the 5-prime primer file within the db directory that is found in the repository. Currently set to sars-cov-2/SARS-CoV-2_primers_5prime_NI.fa .
primer_set_2kb: path to the 2kb-prime set file within the db directory that is found in the repository. Currently set to sars-cov-2/SARS-CoV-2_primers_2kb_set.tsv .
primer_set_1_5kb: path to the 1.5kb-prime set file within the db directory that is found in the repository. Currently set to sars-cov-2/SARS-CoV-2_primers_1.5kb_set.tsv .
virus: the virus name for the samples being run. Currently set to SARS-CoV-2 . Other options include: Influenza-A , Influenza-B , sCoV , MPX .
path: the full path to where the samples being run are located. See above for the proper structure.
length: length of the reference genome(s). For multi-segmented viruses like influenza, this can be a list of lengths, ordered with respect to the FASTA headers.
ref_fasta_headers: The name of FASTA header(s) in the reference genome. For multi-segmented viruses like influenza, this can be a list of headers, ordered with respect to the length.

Note for users from the Bakel lab only

From a typical data release from the sequencing core the starting directory structure will look as follows:

<Run-ID>
└───<sample_ID1>
│ │ <sample_ID1>_1.fastq.gz
│ │ <sample_ID1>_2.fastq.gz
│
└───<sample_ID2>
 │ <sample_ID2>_1.fastq.gz
 │ <sample_ID2>_2.fastq.gz

In order to create the required directory structure for the pipeline run:

python ~/opt/vRAPID/workflow/scripts/organize_run_samples.py

This script queries to PathogenDB, separating the samples into subdirectories based on the virus type, expected subtype, and batch ID. If no subtype is expected, then the script separates the samples into a None subdirectory. This creates the final directory structure within the sequencing run directory as shown before. You can then proceede to generate the samples.csv and multibamqc_input.txt as previously mentioned.

Running vRAPID

Then once the files are generated, the pipeline can be run using the following command:

submitjob 12 -c 4 -m 64 -q private -P acc_PVI ~/opt/vRAPID/run-vRAPID-analysis -i <run-ID> -p <path> -s <samples-run-csv>

Note that the <run-ID> , <path> , and <samples-run-csv> arguments in the command above are optional. If they are not supplied, then they are pulled from the config.yaml file.

vRAPID output

Once the pipeline has completed, the following output files are created:

<Run-ID>
├── <virus1>
│ ├── <expect_subtype>
│ │ ├── <batch_ID1>
│ │	│	├── samples.csv
│ │	│	├── multibamqc_input.txt
│ │	│	├── pipeline.log
│ │	│	├── file_movement_message.txt
│ │	│	├── <Run-ID>_cleaned.txt
│ │	│	├── <Run-ID>_data_release.txt
│ │	│	├── <Run-ID>_software_version.txt
│ │	│	├── <Run-ID>_<batch-ID>_assemblies.csv
│ │	│	├── <Run-ID>_<batch-ID>.tar.gz
│ │	│	├── logs
│ │	│	├── multi_bamqc
│ │	│	├── multiqc_report.html
│ │	│	├── <sample_ID>
│ │	│	│	├── 01_fastqs
│ │	│	│	│	├── <sample_ID>_1.fastq.gz
│ │	│	│	│	├── <sample_ID>_2.fastq.gz
│ │	│	│	├── 02_assembly
│ │	│	│	├── 03_qualityControl
│ │	│	│	├── 04_variants
│ │	│	│	├── 05_status

Accessory vRAPID pipeline scripts

The following is a brief description of the scripts that are called within the Snakefile . For more information, please see the function overview below:

run_pipeline.py : Takes in Illumina paired-end FASTQ files and assembles them against a reference genome, outputting a final FASTA consensus genome sequence and aligned read bam file, and other intermediary output files.
variant_analysis.py : Takes in the consensus FASTA file generated by run_pipeline.py and performs an intra-host variant analysis on the virus library.
run_QC.py : Takes in the bam file generated as an output from the run_pipeline.py script, and performs a quality control analysis on the virus library.
run-genome-annotation.py : Takes in the FASTA file generated from run_pipeline.py and annotates the consensus according to the virus type. For Inluenza samples, this is done using NCBI's Influenza Annotation Tool . For SARS-CoV-2, this is done using vadr .
all-virus_assembly_push.py : Used to upload viral genome assembly data to the Mount Sinai Pathogen Surveillance Program database (PathogenDB).
move_QC_PDB.py : Uploads the output that is obtained by running multiqc , and qualimap to the Mount Sinai Pathogen Surveillance Database (PathogenDB).
cleanup.py : Removes large intermediary output files once the pipeline has completed to conserve space before archival.
generate_run_report.R : Creates a final assembly run status and quality report for the samples that were processed within the run, outputting a CSV file.
data-release.py : Prepares vRAPID pipeline outputs for data release.

Code Snippets

shell:
   """
   mv {input.sample_folder} {output.renamed_file} > {log.rename} 2>&1
   """

SnakeMake From line 36 of workflow/Snakefile

   shell:
      """
      python {params.pipeline} -rd {params.repo_dir} -i {input.sample_folder} -r1 _R1_001.fastq.gz -r2 _R2_001.fastq.gz -r {params.reference} -pf {params.forward_primer} -pr {params.reverse_primer} -g {params.genbankfile} -l {params.length} -headers {params.ref_headers} > {log.assembly} 2>&1
	  """

SnakeMake From line 63 of workflow/Snakefile

shell:
   """
   python {params.var_analysis} -rd {params.repo_dir} -i {input.sample_folder} -r {params.reference} -ps_2 {params.ps_2kb} -ps_1_5 {params.ps_1_5kb} > {log.varanalysis} 2>&1
   """

SnakeMake From line 86 of workflow/Snakefile

shell:
   """
   python {params.QC} -rd {params.repo_dir} -i {input.sample_folder} -kdb /sc/arion/projects/PVI/db/minikraken_8GB_20200312 -r {params.reference} -pf {params.forward_primer} -pr {params.reverse_primer} -v {params.virus} -pc {params.plotcoverage} -tb {params.breakdown} > {log.QC} 2>&1
   """

SnakeMake From line 114 of workflow/Snakefile

   shell:
      """
	  python {params.annotation_script} -i {input.consensus_fasta_file} -v {params.virus} > {log.annotation} 2>&1
      """

SnakeMake From line 133 of workflow/Snakefile

shell:
   """
   python {params.pdbpush} -s {params.sample_name} -r {params.run_ID} -c {params.config} -p {params.path} > {log.PDBPush} 2>&1
   touch {output.pdb_upload_check}
   """

SnakeMake From line 155 of workflow/Snakefile

shell:
   """
   module load qualimap
   qualimap multi-bamqc -d multibamqc_input.txt -r > {log.qualimap} 2>&1
   """

SnakeMake QualiMap From line 173 of workflow/Snakefile

shell:
   """
   multiqc . > {log.multiqc} 2>&1
   """ 

SnakeMake MultiQC From line 190 of workflow/Snakefile

shell:
   """
   python {params.push_script} -r {params.run_ID} > {log.pushQC} 2>&1
   """

SnakeMake From line 209 of workflow/Snakefile

   shell:
      """
      echo "{params.run_ID} was run using the following versions \n" > {output.version_report}
      echo "\n Name                    Version                   Build  Channel \n SAMTOOLS: \n" &>> {output.version_report}
      conda list | grep -w "^samtools" &>> {output.version_report}

      echo "\n CUTADAPT: \n" &>> {output.version_report}
      conda list | grep -w "^cutadapt" &>> {output.version_report}

      echo "\n MINIMAP2: \n" &>> {output.version_report}
      conda list | grep -w "minimap2" &>> {output.version_report}

      echo "\n PILON: \n" &>> {output.version_report}
      conda list | grep "pilon" &>> {output.version_report}

	  echo "\n QUALIMAP: \n" &>> {output.version_report}
	  conda list | grep "qualimap" &>> {output.version_report}

	  echo "\n SHOVILL: \n" &>> {output.version_report}
	  conda list | grep "shovill" &>> {output.version_report}

	  echo "\n PROKKA: \n" &>> {output.version_report}
	  conda list | grep "prokka" &>> {output.version_report}

	  echo "\n KRAKEN2: \n" &>> {output.version_report}
	  conda list | grep "kraken2" &>> {output.version_report}

	  echo "\n PYTHON: \n" &>> {output.version_report}
	  conda list | grep "^python" &>> {output.version_report}

	  echo "\n MULTIQC: \n" &>> {output.version_report}
	  conda list | grep "multiqc" &>> {output.version_report}

	  echo "\n FASTQC: \n" &>> {output.version_report}
	  conda list | grep "fastqc" &>> {output.version_report}

	  echo "\n PICARD Alignment Sumamry Metrics: \n" &>> {output.version_report}
	  conda list | grep "picard" &>> {output.version_report}

	  echo "\n {params.run_ID} was aligned against the following reference: \n {params.reference} \n Using the following genabnk file: \n {params.genbankfile} \n " &>> {output.version_report}
	  echo "\n primer sets used were: \n (1) {params.ps1} \n (2) {params.ps2}" &>> {output.version_report} > {log.versioncheck} 2>&1
      """

SnakeMake FastQC MultiQC Picard Minimap2 kraken2 QualiMap metaprokka pilon shovill From line 231 of workflow/Snakefile

shell:
   """
   python {params.cleanup} -p {params.samples_run} > {log.cleanup} 2>&1
   touch {output.cleanup_report}
   """

SnakeMake From line 289 of workflow/Snakefile

shell:
   """
   Rscript {params.report_script} {params.run_ID} > {log.reporting} 2>&1
   """

SnakeMake From line 309 of workflow/Snakefile

shell:
   """
   python {params.status_script} -p {params.sample_mapping} -r {params.run_ID} -v {params.virus} -f {params.ref_header} > {log.datarelease} 2>&1
   touch {output.data_release_report}
   """