Small collection of tools for performing quality control on coronavirus sequencing data and genomes
Tools and plots for perfoming quality control on coronavirus sequencing results.
Installation
Download the package:
git clone https://github.com/jts/ncov-tools
cd ncov-tools
To use this package, install the dependencies using conda:
conda env create -f workflow/envs/environment.yml
Alternatively, if install times are very slow using conda, we recommend using the conda wrapper: mamba .
Install mamba as follows:
conda install -c conda-forge mamba
Then create the ncov-tools environment using mamba
mamba env create -f workflow/envs/environment.yml
Either way, if you used conda directly or mamba, activate the conda package:
conda activate ncov-qc
Required Configuration
This package is implemented as a snakemake pipeline, so requires a
config.yaml
file to describe where the input files are. To generate QC plots, a bam file with reads mapped to a reference genome is required. Consensus sequences (FASTA) are needed to generate a phylogenetic tree with associated mutations.
As an example, let's say your data is laid out in the following structure:
run_200430/ sampleA.sorted.bam sampleA.consensus.fasta sampleB.sorted.bam sampleB.consensus.fasta resources/ artic_reference.fasta V3/ nCoV-2019.bed
Then your config.yaml should look like:
# path to the top-level directory containing the analysis results
data_root: run_200430
# optionally the plots can have a "run name" prefix. If this is not defined the prefix will be "default"
run_name: my_run
# path to the nCov reference genome
reference_genome: resources/artic_reference.fasta
# the sequencing platform used, can be "oxford-nanopore" or "illumina"
platform: "oxford-nanopore"
# path to the BED file containing the primers, this should follow the format downloaded from
# the ARTIC repository
primer_bed: resources/V3/nCoV-2019.bed
The pipeline is designed to work with the results of
ivar
(illumina) or the artic-ncov2019/fieldbioinformatics workflow (oxford nanopore). It will automatically detect the names of the output files (BAMs, consensus fasta, variants) from these workflows using the
platform
value. If you used a different workflow, you can set the following options to help the pipeline find your files:
# the naming convention for the bam files
# this can use the variables {data_root} (as above) and {sample}
# As per the example above, this will expand to run_200430/sampleA.sorted.bam for sampleA
bam_pattern: "{data_root}/{sample}.sorted.bam"
# the naming convention for the consensus sequences
consensus_pattern: "{data_root}/{sample}.consensus.fasta"
# the naming convention for the variants file, NF illumina runs typically use
# "{data_root}/{sample}.variants.tsv and oxford nanopore runs use "{data_root}/{sample}.pass.vcf.gz"
variants_pattern: "{data_root}/{sample}.variants.tsv
Metadata (optional)
Some plots and QC statistics can be augmented with metadata like the qPCR Ct values, or the date the sample was collected. To enable this feature, add the path to the metadata to config.yaml:
metadata: "/path/to/metadata.tsv"
The expected metadata file is a simple TSV with a
sample
field and optional
ct
and
date
fields. Other fields can be provided but will be ignored.
sample ct date
sampleA 20.8 2020-05-01
sampleB 27.1 2020-06-02
When providing the metadata, the value
NA
can be used for missing data.
Other optional configuration
Additional features can be turned on by adding to the config if desired:
#
# if a list of sample IDs for negative controls is provided, a report containing the amount
# of coverage detected in the negative controls can be generated
#
negative_control_samples: [ "NTC-1", "NTC-2" ]
#
# when building a tree of the consensus genomes you can optionally include other sequences
# in the tree by providing a fasta file here
#
tree_include_consensus: some_genomes_from_gisaid.fasta
# list the type of amplicon BED file that will be created from the "primer_bed". This can include:
# full -- amplicons including primers and overlaps listed in the primer BED file
# no_primers -- amplicons including overlaps but with primers removed
# unique_amplicons -- distinct amplicons regions with primers and overlapping regions removed
bed_type: unique_amplicons
# minimum completeness threshold for inclusion to the SNP tree plot, if no entry
# is provided the default is set to 0.75
completeness_threshold: 0.9
# the set of mutations to automatically flag in the QC reports
# this can be the name of one of the watchlists built into ncov-watch
# or the path to a local VCF file.
# Built in lists: https://github.com/jts/ncov-watch/tree/master/ncov_watch/watchlists
mutation_set: spike_mutations
# user specifiable output directory
# defaults to just current working directory but otherwise
# will write output files to the specified directory
output_directory: run1_output
# primer name prefix used in the primer scheme BED file, the default
# value is "nCoV-2019" which is used for ARTIC V3, note that
# ARTIC V4.1 uses `SARS-CoV-2`
primer_prefix: "SARS-CoV-2"
Running
After configuration, you can run the pipeline using Snakemake
# Build the sequencing QC plots (coverage, allele frequencies)
snakemake -s workflow/Snakefile all_qc_sequencing
# Build the analysis QC plots (tree with annotated mutations)
snakemake -s workflow/Snakefile all_qc_analysis
# Build the quality report tsv files (in qc_reports directory)
snakemake -s workflow/Snakefile all_qc_reports
There is also an
all
rule that executes the three rules noted above in one
snakemake
command:
# Build all the reports and plots
snakemake -s workflow/Snakefile all
You can also build a single PDF summary with the main plots and results. This requires a working installation of pdflatex, which is not provided through the environment
snakemake -s workflow/Snakefile all_final_report
Output
# A plot containing the coverage depth across the SARS-CoV-2 reference genome for each sample in the run
plots/run_name_depth_by_position.pdf
# A plot containing the coverage across all samples, plotted as a heatmap across amplicons
plots/run_name_amplicon_coverage_heatmap.pdf
# A plot with the variation found within each sample, plotted as a tree with associated SNP matrix
plots/run_name_tree_snps.pdf
# A report on per-sample quality metrics and pass/warn/fail criteria
qc_reports/run_name_summary_qc.tsv
# A report on coverage within each negative control
qc_reports/run_name_negative_control_report.tsv
# A report on positions within the genome that are consistently ambiguous across samples (an indicator of possible contamination)
qc_reports/run_name_ambiguous_report.tsv
# A report on samples that have evidence for a mixture of alleles at multiple positions (this code is experimental and still being tested)
qc_reports/run_name_mixture_report.tsv
Variant Annotation
SNVs and Indels are annotated using SNPEff. The
MN908947.3
SNPEff database
is part of the standard set of genomes.
Currently the database is not available for download and requires building. To download the NCBI gene file and build the database, run the following:
snakemake -s workflow/Snakefile --cores 1 build_snpeff_db
Once the database has been built, the workflow can be run using:
snakemake -s workflow/Snakefile --cores 2 all_qc_annotation
Variant annotation output can be found in
qc_annotation
and the recurrent
amino acid change heatmap can be found in
plots/<prefix>_aa_mutation_heatmap.pdf
.
Pangolin Version 4
Pangolin version 4 included several changes which required updates
to the
ncov-tools
environment. By default,
ncov-tools
will run pangolin
4 and will require changes to
ncov-parser
version 1.9 to parse the output
and populate the summary QC file.
Backward compability with Pangolin 3 is available and will require the following
parameter addition in the
config.yaml
file:
pangolin_version: "3"
Note that the specific version is not required, only if it is "3" or "4".
Support for option
--analysis-mode
for
pangolin 4
has been provided as of
ncov-tools
version 1.9.1. The
config.yaml
file should contain the following entry:
pango_analysis_mode: "accurate"
The available options are:
accurate (default)
and
fast
. See the
pangolin
documentation
for further details.
Credit and Acknowledgements
-
The tree-with-SNPs plot was inspired by a plot shared by Mads Albertsen.
-
The script to convert
variants.tsv
files into.vcf
files was obtained from:https://github.com/nf-core/viralrecon/blob/dev/bin/ivar_variants_to_vcf.py
Code Snippets
28 29 | shell: "python {params.rename_script} {params.completeness_opt} {input} > {output}" |
38 39 | shell: "augur align --sequences {input.consensus} --reference-sequence {input.reference} --output {output} --fill-gaps" |
47 48 | shell: "augur tree --alignment {input} --output {output}" |
57 58 | shell: "nw_reroot {input} `head -1 {input.reference} | tr -d \">\"` > {output}" |
68 69 | shell: "python {params.alleles_script} --reference-name MN908947.3 {input} > {output}" |
80 81 | shell: "pangolin --outfile {output} {params.pango_analysis_mode} {input}" |
91 92 | shell: "pangolin -v > {output}; pangolin -pv >> {output}" |
101 102 | shell: 'echo {input} | tr " " "\\n" > {output}' |
113 114 | shell: "cat {input} | ncov-watch -m {params.mutation_set} > {output}" |
128 129 | shell: "python {params.masking_script} -b {input.amplicons} -n {input.negative_control_report} -r {input.reference} -g {input.sample_consensus} -o {output}" |
138 139 | shell: "Rscript {params.plot_script} {output} {input}" |
159 160 161 162 163 164 165 166 167 168 169 | shell: "{params.py_script} --alleles {input.alleles} \ --coverage {input.samplecoverage} \ --variants {input.samplevariants} {params.metadata_opt} \ --indel \ --consensus {input.sampleconsensus} {params.platform_opt} \ --sample {wildcards.sample} \ --lineage {input.lineagereport} \ --aa_table {input.aa_table} \ --mutations {input.watch} {params.pangolin_version_opt} \ --run_name {params.run_name_opt} > {output}" |
179 180 | shell: "{params.py_script} --path qc_analysis > {output}" |
188 189 | shell: "cat {input} | awk 'NR == 1 || $0 !~ /qc_pass/' > {output}" |
75 76 | shell: "python {params.script}" |
85 86 | shell: "{params.script} {input} {output}" |
93 94 | shell: "gzip {input}" |
106 107 | shell: "{params.script} {params.no_log} {params.aa_letter} {params.db} {input} > {output}" |
118 119 | shell: "python {params.script} --file {input.vcf} --sample {params.sample} --output {output}" |
129 130 | shell: "Rscript {params.script} --path qc_annotation --output {output} --threshold {params.threshold}" |
138 139 | shell: "bedtools intersect -a {input.vcf} -b {input.primer_bed} -wb > {output}" |
149 150 | shell: "python {params.script} --sample-name {wildcards.sample} --primer-snp-bed {input.primer_snps} --amplicon-depth-tsv {input.amplicon_depth} > {output}" |
157 158 | shell: "cat {input} | awk 'NR == 1 || $0 !~ /primer_name/' > {output}" |
167 168 | shell: "python {params.script} --input-tsv {input} > {output}" |
259 260 | shell: "{params.script} --primers {input.primers} --offset {params.offset} --bed_type {params.bed_type_opt} --output {output} --primer_prefix {params.primer_prefix}" |
270 271 | shell: "{params.script} --primers {input.primers} --bed_type full --output {output} --primer_prefix {params.primer_prefix}" |
280 281 | shell: "{params.exec} {input}" |
290 291 | shell: "cat {input} | awk '{{ print $1 \"\t0\t\" $2 }}' > {output}" |
38 39 | shell: "python {params.script} --primer_prefix {params.primer_prefix} {input.bed} > {output}" |
50 51 | shell: "python {params.script} --fpileup {input.fpileups} --alleles {input.alleles} > {output}" |
61 62 | shell: "python {params.script} --alleles {input.alleles} --min-count 3 > {output}" |
75 76 | shell: "python {params.script} --run-name {params.run_name} --voc-lineages {params.voc_lineages} {params.platform_opt} > {output}" |
21 22 23 | shell: "echo -e \"reference_name\tstart\tend\tamplicon_id\tpool\tstrand\tread_count\tcovered_bases\tamplicon_length\tfraction_covered\" > {output};" "bedtools coverage -a {input.bed} -b {input.bam} >> {output}" |
36 37 38 | shell: "echo -e \"reference_name\tstart\tend\tamplicon_id\tpool\tstrand\tmean_depth\" > {output};" "bedtools coverage -mean -a {input.bed} -b {input.bam} >> {output}" |
51 52 53 | shell: "echo -e \"reference_name\tstart\tend\tamplicon_id\tpool\tstrand\tposition\tdepth\" > {output};" "bedtools coverage -d -a {input.bed} -b {input.bam} >> {output}" |
65 66 67 | shell: "echo -e \"reference_name\tstart\tend\tposition\tdepth\" > {output};" "bedtools coverage -d -a {input.bed} -b {input.bam} >> {output}" |
79 80 | shell: "ln -s \"$(readlink -f {input})\" {output}" |
88 89 | shell: "samtools index {input}" |
101 102 | shell: "python {params.pileup_script} --bam {input.bam} --reference {input.reference} > {output}" |
110 111 | shell: 'echo {input} | tr " " "\\n" > {output}' |
126 127 | shell: "Rscript {params.plot_script} --path qc_sequencing --output {output.plot} --table {output.table}" |
145 146 | shell: "Rscript {params.plot_script} -t depth_by_position -o {output} {params.metadata_opt}" |
157 158 | shell: "Rscript {params.plot_script} -t negative_control_depth_by_position -o {output} {params.metadata_opt}" |
169 170 | shell: "Rscript {params.plot_script} -t amplicon_depth_by_ct -o {output} -m {input.metadata}" |
180 181 | shell: "Rscript {params.plot_script} -t amplicon_covered_fraction -o {output}" |
192 193 | shell: "Rscript {params.plot_script} -t genome_completeness_by_ct -o {output} -m {input.metadata}" |
Support
- Future updates
Related Workflows





