Assembly and intrahost / low-frequency variant calling for viral samples
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Introduction
nf-core/vipr is a bioinformatics best-practice analysis pipeline for assembly and intrahost / low-frequency variant calling for viral samples.
The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.
Pipeline Steps
Step | Main program/s |
---|---|
Trimming, combining of read-pairs per sample and QC | Skewer, FastQC |
Decontamination | decont |
Metagenomics classification / Sample purity | Kraken |
Assembly to contigs | BBtools' Tadpole |
Assembly polishing | ViPR Tools |
Mapping to assembly | BWA, LoFreq |
Low frequency variant calling | LoFreq |
Coverage and variant AF plots (two processes) | Bedtools, ViPR Tools |
Documentation
Documentation about the pipeline can be found in the
docs/
directory:
Credits
This pipeline was originally developed by Andreas Wilm ( andreas-wilm ) at Genome Institute of Singapore . It started out as an ecosystem around LoFreq and went through a couple of iterations. The current version had three predecessors ViPR 1 , ViPR 2 and ViPR 3
An incomplete list of publications using (previous versions of) ViPR:
Plenty of people provided essential feedback, including:
-
ZHU Yuan
-
Shuzhen SIM
-
CHU Wenhan Collins
Code Snippets
143 144 145 146 147 148 149 150 151 152 153 154 155 | """ # loop over readunits in pairs per sample pairno=0 echo ${reads.join(" ")} | xargs -n2 | while read fq1 fq2; do let pairno=pairno+1 # note: don't make reads smaller than assembler kmer length skewer --quiet -t ${task.cpus} -m pe -q 3 -n -l 31 -z -o pair\${pairno}-skewer-out \$fq1 \$fq2; cat *-trimmed-pair1.fastq.gz >> ${sample_id}_R1-trimmed.fastq.gz; cat *-trimmed-pair2.fastq.gz >> ${sample_id}_R2-trimmed.fastq.gz; rm *-trimmed-pair[12].fastq.gz; done fastqc -t {task.cpus} ${sample_id}_R1-trimmed.fastq.gz ${sample_id}_R2-trimmed.fastq.gz; """ |
176 177 178 179 180 | """ decont.py -i ${fq1} ${fq2} -t ${task.cpus} -c 0.5 -r ${cont_fasta} -o ${sample_id}_trimmed_decont; # since this is the last fastqc processing step, let's run fastqc here fastqc -t {task.cpus} ${sample_id}_trimmed_decont_1.fastq.gz ${sample_id}_trimmed_decont_2.fastq.gz; """ |
196 197 198 199 200 201 | """ kraken --threads ${task.cpus} --preload --db ${kraken_db} \ -paired ${fq1} ${fq2} > kraken.out; # do not gzip! otherwise kraken-report happily runs (with some warnings) and produces rubbish results kraken-report --db ${kraken_db} kraken.out > ${sample_id}_kraken.report """ |
219 220 221 | """ tadpole.sh -Xmx10g threads=${task.cpus} in=${fq1} in2=${fq2} out=${sample_id}_contigs.fa, """ |
241 242 243 244 245 246 247 248 249 250 251 252 253 254 | """ set +e; log=${sample_id}-gap-filled-assembly.log; simple_contig_joiner.py -c ${contigs_fa} -r ${input_ref_fasta} \ -s "${sample_id}-gap-filled-assembly" -o ${sample_id}-gap-filled-assembly.fa \ -b "${sample_id}-gap-filled-assembly.gaps.bed" >& \$log; rc=\$? if [ \$rc -ne 0 ]; then # nothing to join, means we cannot continue. so stop here. grep 'Nothing to join' \$log && exit 3; exit \$rc; fi """ |
270 271 272 273 274 275 276 | """ # downsample to 1M reads to increase runtime seqtk sample -s 666 ${fq1} 1000000 | gzip > R1_ds.R1.fastq.gz; seqtk sample -s 666 ${fq2} 1000000 | gzip > R2_ds.R2.fastq.gz; polish_viral_ref.sh -t ${task.cpus} -1 R1_ds.R1.fastq.gz -2 R2_ds.R2.fastq.gz \ -r ${assembly_fa} -o ${sample_id}_polished_assembly.fa """ |
294 295 296 297 298 299 300 301 302 303 304 | """ bwa index ${ref_fa}; samtools faidx ${ref_fa}; bwa mem -t ${task.cpus} ${ref_fa} ${fq1} ${fq2} | \ lofreq viterbi -f ${ref_fa} - | \ lofreq alnqual -u - ${ref_fa} | \ lofreq indelqual --dindel -f ${ref_fa} - | \ samtools sort -o ${sample_id}.bam -T ${sample_id}.final.tmp -; samtools index ${sample_id}.bam; samtools stats ${sample_id}.bam > ${sample_id}.bam.stats """ |
319 320 321 322 323 | """ samtools faidx ${ref_fa}; lofreq call-parallel --pp-threads ${task.cpus} -f ${ref_fa} \ -d 1000000 --call-indels -o ${sample_id}.vcf.gz ${bam} """ |
340 341 342 343 | """ # note: -d is one-based. -dz is zero-based but only non-zero values, so less explicit. bedtools genomecov -d -ibam ${bam} | gzip > ${sample_id}.cov.gz; """ |
358 359 360 361 | """ vipr_af_vs_cov_html.py --vcf ${vcf} --cov ${cov} --plot ${sample_id}_af-vs-cov.html; vipr_gaps_to_n.py -i ${ref_fa} -c ${cov} > ${sample_id}_0cov2N.fa; """ |
Support
- Future updates
Related Workflows





