V-pipe (main multi-virus version) workflow designed for the analysis of (NGS) data from viral pathogens.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
V-pipe is a workflow designed for the analysis of next generation sequencing (NGS) data from viral pathogens. It produces a number of results in a curated format (e.g., consensus sequences, SNV calls, local/global haplotypes). V-pipe is written using the Snakemake workflow management system.
Usage
Different ways of initializing V-pipe are presented below. We strongly encourage you to deploy it using the quick install script , as this is our preferred method.
To configure V-pipe refer to the documentation present in config/README.md .
V-pipe expects the input samples to be organized in a
two-level
directory hierarchy, and the sequencing reads must be provided in a sub-folder named
raw_data
. Further details can be found on the
website
. Check the utils subdirectory for
mass-importers tools
that can assist you in generating this hierarchy.
We provide virus-specific base configuration files which contain handy defaults for, e.g., HIV and SARS-CoV-2. Set the virus in the general section of the configuration file:
general: virus_base_config: hiv
Also see snakemake's documentation to learn more about the command-line options available when executing the workflow.
Tutorials introducing usage of V-pipe are available in the docs/ subdirectory.
Using quick install script
To deploy V-pipe, use the installation script with the following parameters:
curl -O 'https://raw.githubusercontent.com/cbg-ethz/V-pipe/master/utils/quick_install.sh'
./quick_install.sh -w work
This script will download and install miniconda, checkout the V-pipe git repository (use
-b
to specify which branch/tag) and setup a work directory (specified with
-w
) with an executable script that will execute the workflow:
cd work
# edit config.yaml and provide samples/ directory
./vpipe --jobs 4 --printshellcmds --dry-run
Using Docker
Note: the docker image is only setup with components to run the workflow for HIV and SARS-CoV-2 virus base configurations. Using V-pipe with other viruses or configurations might require internet connectivity for additional software components.
Create
config.yaml
or
vpipe.config
and then populate the
samples/
directory. For example, the following config file could be used:
general: virus_base_config: hiv
output: snv: true local: true global: false visualization: true QA: true
Then execute:
docker run --rm -it -v $PWD:/work ghcr.io/cbg-ethz/v-pipe:master --jobs 4 --printshellcmds --dry-run
Using Snakedeploy
First install mamba , then create and activate an environment with Snakemake and Snakedeploy:
mamba create -c bioconda -c conda-forge --name snakemake snakemake snakedeploy
conda activate snakemake
Snakemake's official workflow installer Snakedeploy can now be used:
snakedeploy deploy-workflow https://github.com/cbg-ethz/V-pipe --tag master .
# edit config/config.yaml and provide samples/ directory
snakemake --use-conda --jobs 4 --printshellcmds --dry-run
Code Snippets
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" source {params.FUNCTIONS} ERRFILE=$(basename {log.errfile}) OUTFILE=$(basename {log.outfile}) # 1. copy initial reference for bwa rm -rf {params.WORK_DIR}/ mkdir -p {params.WORK_DIR}/ cp {input.global_ref} {params.WORK_DIR}/consensus.fasta cd {params.WORK_DIR} # 2. create bwa index {params.BWA} index consensus.fasta 2> >(tee $ERRFILE >&2) # 3. create initial alignment if [[ {params.PAIRED_BOOL} == "true" ]]; then {params.BWA} mem -t {threads} consensus.fasta ../preprocessed_data/R{{1,2}}.fastq > first_aln.sam 2> >(tee -a $ERRFILE >&2) else {params.BWA} mem -t {threads} consensus.fasta ../preprocessed_data/R1.fastq > first_aln.sam 2> >(tee -a $ERRFILE >&2) fi rm consensus.fasta.* # 4. remove unmapped reads {params.SAMTOOLS} view -b -F 4 first_aln.sam > mapped.bam 2> >(tee -a $ERRFILE >&2) rm first_aln.sam # 5. extract reads mkdir -p cleaned SamToFastq {params.PICARD} I=mapped.bam FASTQ=cleaned/R1.fastq {params.PAIRED} VALIDATION_STRINGENCY=SILENT 2> >(tee -a $ERRFILE >&2) rm mapped.bam # 6. create config file # NOTE: Tabs are required below if [[ {params.PAIRED_BOOL} == "true" ]]; then cat > vicuna_config.txt <<- _EOF_ minMSize 9 maxOverhangSize 2 Divergence 8 max_read_overhang 2 max_contig_overhang 10 pFqDir cleaned/ batchSize 100000 LibSizeLowerBound 100 LibSizeUpperBound 800 min_output_contig_len 1000 outputDIR ./ _EOF_ else cat > vicuna_config.txt <<- _EOF_ minMSize 9 maxOverhangSize 2 Divergence 8 max_read_overhang 2 max_contig_overhang 10 npFqDir cleaned/ batchSize 100000 min_output_contig_len 1000 outputDIR ./ _EOF_ fi # 7. VICUNA OMP_NUM_THREADS={threads} {params.VICUNA} vicuna_config.txt > $OUTFILE 2> >(tee -a $ERRFILE >&2) rm vicuna_config.txt rm -r cleaned/ # 8. fix broken header sed -e 's:>dg-\([[:digit:]]\+\)\s.*:>dg-\1:g' contig.fasta > contig_clean.fasta # 9. InDelFixer + ConsensusFixer to polish up consensus for i in {{1..3}} do mv consensus.fasta old_consensus.fasta indelFixer {params.INDELFIXER} -i contig_clean.fasta -g old_consensus.fasta >> $OUTFILE 2> >(tee -a $ERRFILE >&2) sam2bam {params.SAMTOOLS} reads.sam >> $OUTFILE 2> >(tee $ERRFILE >&2) consensusFixer {params.CONSENSUSFIXER} -i reads.bam -r old_consensus.fasta -mcc 1 -mic 1 -d -pluralityN 0.01 >> $OUTFILE 2> >(tee $ERRFILE >&2) done sed -i -e "s/>.*/>${{CONSENSUS_NAME}}/" consensus.fasta echo "" >> consensus.fasta # 10. finally, move into place mkdir -p ../references mv {{,../references/vicuna_}}consensus.fasta """ |
158 159 160 161 162 163 164 165 | shell: """ cat {input} > initial_ALL.fasta {params.MAFFT} --nuc --preservecase --maxiterate 1000 --localpair --thread {threads} initial_ALL.fasta > references/initial_aln.fasta 2> >(tee {log.errfile} >&2) rm initial_ALL.fasta {params.REMOVE_GAPS} references/initial_aln.fasta -o {output} -p 0.5 > {log.outfile} 2> >(tee -a {log.errfile} >&2) """ |
181 182 183 184 185 186 187 188 189 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" mkdir -p {wildcards.dataset}/references/ {params.EXTRACT_SEQ} {input} -o {output} -s "${{CONSENSUS_NAME}}" """ |
201 202 203 204 205 206 207 208 209 210 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" mkdir -p {wildcards.dataset}/references/ cp {input} {output} sed -i -e "s/>.*/>${{CONSENSUS_NAME}}/" {output} """ |
222 223 224 225 226 227 228 229 230 231 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" mkdir -p {wildcards.dataset}/references/ cp {input} {output} sed -i -e "s/>.*/>${{CONSENSUS_NAME}}/" {output} """ |
281 282 283 284 | shell: """ {params.GUNZIP} -c {input} > {output} 2> >(tee {log.errfile} >&2) """ |
314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" # 1. clean previous run rm -rf {wildcards.dataset}/alignments rm -f {wildcards.dataset}/references/ref_ambig.fasta rm -f {wildcards.dataset}/references/ref_majority.fasta mkdir -p {wildcards.dataset}/alignments mkdir -p {wildcards.dataset}/references # 2. perform alignment # -l = leave temps {params.NGSHMMALIGN} -v {params.EXTRA} -R {input.initial_ref} -o {output.good_aln} -w {output.reject_aln} -t {threads} -N "${{CONSENSUS_NAME}}" {params.LEAVE_TEMP} {input.FASTQ} > {log.outfile} 2> >(tee {log.errfile} >&2) # 3. move references into place mv {wildcards.dataset}/{{alignments,references}}/ref_ambig.fasta mv {wildcards.dataset}/{{alignments,references}}/ref_majority.fasta """ |
365 366 367 368 369 370 | shell: """ cat {input} > ALL_{wildcards.kind}.fasta {params.MAFFT} --nuc --preservecase --maxiterate 1000 --localpair --thread {threads} ALL_{wildcards.kind}.fasta > {output} 2> >(tee {log.errfile} >&2) rm ALL_{wildcards.kind}.fasta """ |
407 408 409 410 | shell: """ {params.CONVERT_REFERENCE} -t {params.REF_NAME} -m {input.REF_ambig} -i {input.BAM} -o {output} > {log.outfile} 2> >(tee {log.errfile} >&2) """ |
441 442 443 444 445 446 447 | shell: """ echo "Writing BAM file" rm -f '{params.sort_tmp}'.[0-9]*.bam {params.SAMTOOLS} sort -T "{params.sort_tmp}" -o "{output.BAM}" "{input}" > {log.outfile} 2> >(tee {log.errfile} >&2) {params.SAMTOOLS} index "{output.BAM}" >> {log.outfile} 2> >(tee -a {log.errfile} >&2) """ |
475 476 477 478 | shell: """ {params.BWA} index {input} 2> >(tee {log.errfile} >&2) """ |
521 522 523 524 525 526 | shell: """ {params.BWA} mem -t {threads} {params.EXTRA} -o "{output.TMP_SAM}" "{input.REF}" {input.FASTQ} 2> >(tee {log.errfile} >&2) # Filter alignments: (1) remove unmapped reads (single-end) or keep only reads mapped in proper pairs (paired-end), (2) remove supplementary aligments {params.SAMTOOLS} view -h {params.FILTER} -F 2048 -o "{output.REF}" "{output.TMP_SAM}" 2> >(tee -a {log.errfile} >&2) """ |
551 552 553 554 | shell: """ {params.BOWTIE} {input} {input} 2> >(tee {log.errfile} >&2) """ |
588 589 590 591 592 593 | shell: """ {params.BOWTIE} -x {input.REF} -1 {input.R1} -2 {input.R2} {params.PHRED} {params.PRESET} -X {params.MAXINS} {params.EXTRA} -p {threads} -S {output.TMP_SAM} 2> >(tee {log.errfile} >&2) # Filter alignments: (1) keep only reads mapped in proper pairs, and (2) remove supplementary aligments {params.SAMTOOLS} view -h -f 2 -F 2048 -o "{output.REF}" "{output.TMP_SAM}" 2> >(tee -a {log.errfile} >&2) """ |
625 626 627 628 629 630 | shell: """ {params.BOWTIE} -x {input.REF} -U {input.R1} {params.PHRED} {params.PRESET} {params.EXTRA} -p {threads} -S {output.TMP_SAM} 2> >(tee {log.errfile} >&2) # Filter alignments: (1) remove unmapped reads, and (2) remove supplementary aligments {params.SAMTOOLS} view -h -F 4 -F 2048 -o "{output.REF}" "{output.TMP_SAM} 2> >(tee -a {log.errfile} >&2) """ |
11 12 13 14 | shell: """ rm -rf {params.DIR}/*/*/extracted_data """ |
20 21 22 23 | shell: """ rm -rf {params.DIR}/*/*/preprocessed_data """ |
29 30 31 32 33 34 35 36 37 | shell: """ rm -rf {params.DIR}/*/*/initial_consensus rm -rf {params.DIR}/*/*/references/vicuna_consensus.fasta rm -rf {params.DIR}/*/*/references/initial_consensus.fasta rm -rf references/initial_aln.fasta rm -rf references/initial_aln_gap_removed.fasta rm -rf references/MAFFT_initial_aln.* """ |
41 42 43 44 45 | shell: """ rm -rf references/ALL_aln_*.fasta rm -rf references/MAFFT_*_cohort.* """ |
51 52 53 54 55 56 57 58 | shell: """ rm -rf {params.DIR}/*/*/alignments rm -rf {params.DIR}/*/*/QA_alignments rm -rf {params.DIR}/*/*/references/ref_ambig.fasta rm -rf {params.DIR}/*/*/references/ref_majority.fasta rm -rf {params.DIR}/*/*/references/initial_consensus.fasta """ |
66 67 68 69 70 71 72 | shell: """ rm -f {input} rm -rf {params.DIR}/*/*/alignments rm -rf {params.DIR}/*/*/references/ref_ambig*.fasta rm -rf {params.DIR}/*/*/references/ref_majority*.fasta """ |
80 81 82 83 84 85 86 | shell: """ rm -f {input} rm -rf {params.DIR}/*/*/alignments rm -rf {params.DIR}/*/*/references/ref_ambig*.fasta rm -rf {params.DIR}/*/*/references/ref_majority*.fasta """ |
92 93 94 95 | shell: """ rm -rf {params.DIR}/*/*/variants/SNVs """ |
101 102 103 104 105 | shell: """ rm -rf {params.DIR}/*/*/variants/global/contigs_stage_?.fasta rm -rf {params.DIR}/*/*/variants/global/stage_? """ |
111 112 113 114 | shell: """ rm {params.DIR}/*/*/variants/global/quasispecies.* """ |
120 121 122 123 | shell: """ rm -rf {params.DIR}/*/*/variants/global/predicthaplo/ """ |
129 130 131 132 | shell: """ rm -rf {params.DIR}/*/*/visualization """ |
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | shell: """ {params.bcftools} mpileup \ --threads {threads} \ -Ou \ -f {input.fname_ref} \ --max-depth {params.max_coverage} \ --max-idepth {params.max_coverage} \ --annotate FORMAT/AD,FORMAT/DP,INFO/AD \ {input.fname_bam} \ | {params.bcftools} call \ --threads {threads} \ -Ou \ -mv \ --keep-alts \ | {params.bcftools} norm \ --threads {threads} \ -Ou \ -f {input.fname_ref} \ | {params.bcftools} filter \ --threads {threads} \ -e 'TYPE="INDEL" & INFO/AD[1]<INFO/AD[0]' \ -Ob \ --output {output.fname_temp_bcf} \ 2> >(tee {log.errfile} >&2) #bcftools csq -f {input.fname_ref} -g wheretogetthis.gff3.gz in.vcf -Ob -o out.bcf {params.gunzip} -c {input.fname_cov} | tail -n +2 \ | awk -v base={params.tsvbased} \ '$3 < {params.mask_coverage_threshold} {{printf "%s\\t%d\\t%d\\n", $1, $2 - base, $2 - base + 1}}' \ > {output.fname_mask_lowcoverage} \ 2> >(tee -a {log.errfile} >&2) # preparations {params.enhance_bcf} \ {output.fname_temp_bcf} \ {output.fname_bcf} \ {params.ambiguous_base_coverage_threshold} \ 2> >(tee -a {log.errfile} >&2) {params.bcftools} index {output.fname_bcf} 2> >(tee -a {log.errfile} >&2) common_consensus_params="--fasta-ref {input.fname_ref} --mark-del - --mask {output.fname_mask_lowcoverage} --mask-with n" # majority bases {params.bcftools} consensus \ --output {output.fname_fasta} \ --chain {output.fname_chain} \ $common_consensus_params \ -H A \ -i "INFO/AD[0]<INFO/AD[*]" \ {output.fname_bcf} \ 2> >(tee -a {log.errfile} >&2) # ambiguous bases {params.bcftools} consensus \ --output {output.fname_fasta_ambig} \ --chain {output.fname_chain_ambig} \ $common_consensus_params \ -H I --iupac-codes \ {output.fname_bcf} \ 2> >(tee -a {log.errfile} >&2) """ |
140 141 142 143 144 145 146 147 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" {params.EXTRACT_CONSENSUS} -i {input.BAM} -f {input.REF} -c {params.MIN_COVERAGE} -n {params.N_COVERAGE} -q {params.QUAL_THRD} -a {params.MIN_FREQ} -N "${{CONSENSUS_NAME}}" -o {params.OUTDIR} > {log.outfile} 2> >(tee -a {log.errfile} >&2) """ |
172 173 174 175 176 177 178 179 180 | shell: """ if tail -n +2 {input.REF_majority_dels} | grep -qE '[^n]'; then {params.MATCHER} -asequence {input.REF} -bsequence {input.REF_majority_dels} -outfile {output.REF_matcher} 2> >(tee {log.errfile} >&2) else touch {output.REF_matcher} echo "pure 'nnnn...' consensus, no possible alignement" | tee {log.outfile} fi """ |
210 211 212 213 | shell: """ {params.FRAMESHIFT_DEL_CHECKS} -i {input.BAM} -c {input.REF_majority_dels} -f {input.REF_NAME} -g {input.GENES_GFF} --english=true -o {output.FRAMESHIFT_DEL_CHECK_TSV} 2> >(tee {log.errfile} >&2) """ |
34 35 36 37 | shell: """ {params.ALN2BASECNT} --first "{params.ARRAYBASED}" --basecnt "{output.BASECNT}" --coverage "{output.COVERAGE}" --name "{params.NAME}" --stats "{output.STATS}" "{input.BAM}" > {log.outfile} 2> >(tee {log.errfile} >&2) """ |
56 57 58 | run: with open(output[0], "w") as out: out.write("\n".join(input)) |
101 102 103 104 | shell: """ {params.GATHER_COVERAGE} --output {output.COVERAGE} --stats {output.COVSTATS} --threads {threads} @{input.COVLIST} > >(tee {log.outfile}) 2> >(tee {log.errfile} >&2) """ |
155 156 157 158 | shell: """ {params.MINORITY_CALLER} -r {input.REF} -c {params.MIN_COVERAGE} -N {params.NAMES} -t {threads} -o {params.OUTDIR} {params.FREQUENCIES} {input.BAM} > >(tee {log.outfile}) 2> >(tee {log.errfile} >&2) """ |
30 31 32 33 | shell: """ {params.GUNZIP} -c {input} > {output} """ |
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | shell: """ echo "The length cutoff is: {params.LEN_CUTOFF}" > {log.outfile} {params.PRINSEQ} -fastq {input.R1} -fastq2 {input.R2} {params.EXTRA} -out_format 3 -out_good {wildcards.dataset}/preprocessed_data/R -out_bad null -min_len {params.LEN_CUTOFF} -log {log.outfile} 2> >(tee {log.errfile} >&2) # make sure that the lock held prinseq has been effectively released and propagated # on some networked shares this could otherwise lead to confusion or corruption if [[ "$OSTYPE" =~ ^linux ]]; then echo "Waiting for unlocks" >&2 for U in {wildcards.dataset}/preprocessed_data/R_{{1,2}}.fastq; do flock -x -o ${{U}} -c "echo ${{U}} unlocked >&2" done fi mv {wildcards.dataset}/preprocessed_data/R{{_,}}1.fastq mv {wildcards.dataset}/preprocessed_data/R{{_,}}2.fastq rm -f {wildcards.dataset}/preprocessed_data/R_?_singletons.fastq gzip {wildcards.dataset}/preprocessed_data/R1.fastq gzip {wildcards.dataset}/preprocessed_data/R2.fastq """ |
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | shell: """ echo "The length cutoff is: {params.LEN_CUTOFF}" > {log.outfile} {params.PRINSEQ} -fastq {input.R1} {params.EXTRA} -out_format 3 -out_good {wildcards.dataset}/preprocessed_data/R -out_bad null -min_len {params.LEN_CUTOFF} {params.EXTRA} -log {log.outfile} 2> >(tee {log.errfile} >&2) # make sure that the lock held prinseq has been effectively released and propagated # on some network shares this could otherwise lead to confusion or corruption if [[ "$OSTYPE" =~ ^linux ]]; then echo "Waiting for unlocks" >&2 for U in {wildcards.dataset}/preprocessed_data/R_{{1,2}}.fastq; do flock -x -o ${{U}} -c "echo ${{U}} unlocked >&2" done fi mv {wildcards.dataset}/preprocessed_data/R{{,1}}.fastq gzip {wildcards.dataset}/preprocessed_data/R1.fastq """ |
187 188 189 190 | shell: """ {params.FASTQC} -o {params.OUTDIR} -t {threads} {params.NOGROUP} {input} 2> >(tee {log.errfile} >&2) """ |
42 43 44 45 | shell: """ {params.EXTRACT_COVERAGE_INTERVALS} -b {params.ARRAYBASED} -cf {input.TSV} -c {params.COVERAGE} --no-shorah -N {params.NAMES} -o {output} {input.BAM} > >(tee {log.outfile}) 2>&1 """ |
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | shell: """ SAMPLE_ID="{params.ID}" # Number of input reads LINECOUNT=$( cat {input.R1} | wc -l ) let "INPUT=LINECOUNT / {params.FACTOR}" # Number of reads after QC # For portability reason not using zcat {params.GUNZIP} -c {input.R1_QC} > {params.R1_temp} LINECOUNT=$( cat {params.R1_temp} | wc -l ) let "READCOUNT=LINECOUNT / {params.FACTOR}" rm {params.R1_temp} # Number of aligned reads ALNCOUNT=$( {params.SAMTOOLS} view {input.BAM} | wc -l ) echo -e "${{SAMPLE_ID}}\t${{INPUT}}\t${{READCOUNT}}\t${{ALNCOUNT}}" > {output} """ |
91 92 93 94 95 96 | shell: """ cat {input} > stats/temp echo -e "ID\tInput\tQC\tAlignments" | cat - stats/temp > {output} rm stats/temp """ |
Support
- Future updates
Related Workflows





