Small collection of tools for performing quality control on coronavirus sequencing data and genomes

public 1yr ago Version: v1.9.1 0 bookmarks

View Workflow

Tools and plots for perfoming quality control on coronavirus sequencing results.

Installation

Download the package:

git clone https://github.com/jts/ncov-tools
cd ncov-tools

To use this package, install the dependencies using conda:

conda env create -f workflow/envs/environment.yml

Alternatively, if install times are very slow using conda, we recommend using the conda wrapper: mamba .

Install mamba as follows:

conda install -c conda-forge mamba

Then create the ncov-tools environment using mamba

mamba env create -f workflow/envs/environment.yml

Either way, if you used conda directly or mamba, activate the conda package:

conda activate ncov-qc

Required Configuration

This package is implemented as a snakemake pipeline, so requires a config.yaml file to describe where the input files are. To generate QC plots, a bam file with reads mapped to a reference genome is required. Consensus sequences (FASTA) are needed to generate a phylogenetic tree with associated mutations.

As an example, let's say your data is laid out in the following structure:

 run_200430/ sampleA.sorted.bam sampleA.consensus.fasta sampleB.sorted.bam sampleB.consensus.fasta resources/ artic_reference.fasta V3/ nCoV-2019.bed

Then your config.yaml should look like:

# path to the top-level directory containing the analysis results
data_root: run_200430
# optionally the plots can have a "run name" prefix. If this is not defined the prefix will be "default"
run_name: my_run
# path to the nCov reference genome
reference_genome: resources/artic_reference.fasta
# the sequencing platform used, can be "oxford-nanopore" or "illumina"
platform: "oxford-nanopore"
# path to the BED file containing the primers, this should follow the format downloaded from
# the ARTIC repository
primer_bed: resources/V3/nCoV-2019.bed

The pipeline is designed to work with the results of ivar (illumina) or the artic-ncov2019/fieldbioinformatics workflow (oxford nanopore). It will automatically detect the names of the output files (BAMs, consensus fasta, variants) from these workflows using the platform value. If you used a different workflow, you can set the following options to help the pipeline find your files:

# the naming convention for the bam files
# this can use the variables {data_root} (as above) and {sample}
# As per the example above, this will expand to run_200430/sampleA.sorted.bam for sampleA
bam_pattern: "{data_root}/{sample}.sorted.bam"
# the naming convention for the consensus sequences
consensus_pattern: "{data_root}/{sample}.consensus.fasta"
# the naming convention for the variants file, NF illumina runs typically use
# "{data_root}/{sample}.variants.tsv and oxford nanopore runs use "{data_root}/{sample}.pass.vcf.gz"
variants_pattern: "{data_root}/{sample}.variants.tsv

Metadata (optional)

Some plots and QC statistics can be augmented with metadata like the qPCR Ct values, or the date the sample was collected. To enable this feature, add the path to the metadata to config.yaml:

metadata: "/path/to/metadata.tsv"

The expected metadata file is a simple TSV with a sample field and optional ct and date fields. Other fields can be provided but will be ignored.

sample ct date
sampleA 20.8 2020-05-01
sampleB 27.1 2020-06-02

When providing the metadata, the value NA can be used for missing data.

Other optional configuration

Additional features can be turned on by adding to the config if desired:

#
# if a list of sample IDs for negative controls is provided, a report containing the amount
# of coverage detected in the negative controls can be generated
#
negative_control_samples: [ "NTC-1", "NTC-2" ]
#
# when building a tree of the consensus genomes you can optionally include other sequences
# in the tree by providing a fasta file here
#
tree_include_consensus: some_genomes_from_gisaid.fasta
# list the type of amplicon BED file that will be created from the "primer_bed". This can include:
# full -- amplicons including primers and overlaps listed in the primer BED file
# no_primers -- amplicons including overlaps but with primers removed
# unique_amplicons -- distinct amplicons regions with primers and overlapping regions removed
bed_type: unique_amplicons
# minimum completeness threshold for inclusion to the SNP tree plot, if no entry
# is provided the default is set to 0.75
completeness_threshold: 0.9
# the set of mutations to automatically flag in the QC reports
# this can be the name of one of the watchlists built into ncov-watch
# or the path to a local VCF file. 
# Built in lists: https://github.com/jts/ncov-watch/tree/master/ncov_watch/watchlists
mutation_set: spike_mutations
# user specifiable output directory 
# defaults to just current working directory but otherwise
# will write output files to the specified directory
output_directory: run1_output
# primer name prefix used in the primer scheme BED file, the default
# value is "nCoV-2019" which is used for ARTIC V3, note that
# ARTIC V4.1 uses `SARS-CoV-2`
primer_prefix: "SARS-CoV-2"

Running

After configuration, you can run the pipeline using Snakemake

# Build the sequencing QC plots (coverage, allele frequencies)
snakemake -s workflow/Snakefile all_qc_sequencing
# Build the analysis QC plots (tree with annotated mutations)
snakemake -s workflow/Snakefile all_qc_analysis
# Build the quality report tsv files (in qc_reports directory)
snakemake -s workflow/Snakefile all_qc_reports

There is also an all rule that executes the three rules noted above in one snakemake command:

# Build all the reports and plots
snakemake -s workflow/Snakefile all

You can also build a single PDF summary with the main plots and results. This requires a working installation of pdflatex, which is not provided through the environment

snakemake -s workflow/Snakefile all_final_report

Output

# A plot containing the coverage depth across the SARS-CoV-2 reference genome for each sample in the run
plots/run_name_depth_by_position.pdf
# A plot containing the coverage across all samples, plotted as a heatmap across amplicons
plots/run_name_amplicon_coverage_heatmap.pdf
# A plot with the variation found within each sample, plotted as a tree with associated SNP matrix
plots/run_name_tree_snps.pdf
# A report on per-sample quality metrics and pass/warn/fail criteria
qc_reports/run_name_summary_qc.tsv
# A report on coverage within each negative control
qc_reports/run_name_negative_control_report.tsv
# A report on positions within the genome that are consistently ambiguous across samples (an indicator of possible contamination)
qc_reports/run_name_ambiguous_report.tsv
# A report on samples that have evidence for a mixture of alleles at multiple positions (this code is experimental and still being tested)
qc_reports/run_name_mixture_report.tsv

Variant Annotation

SNVs and Indels are annotated using SNPEff. The MN908947.3 SNPEff database is part of the standard set of genomes.

Currently the database is not available for download and requires building. To download the NCBI gene file and build the database, run the following:

snakemake -s workflow/Snakefile --cores 1 build_snpeff_db

Once the database has been built, the workflow can be run using:

snakemake -s workflow/Snakefile --cores 2 all_qc_annotation

Variant annotation output can be found in qc_annotation and the recurrent amino acid change heatmap can be found in plots/<prefix>_aa_mutation_heatmap.pdf .

Pangolin Version 4

Pangolin version 4 included several changes which required updates to the ncov-tools environment. By default, ncov-tools will run pangolin 4 and will require changes to ncov-parser version 1.9 to parse the output and populate the summary QC file.

Backward compability with Pangolin 3 is available and will require the following parameter addition in the config.yaml file:

pangolin_version: "3"

Note that the specific version is not required, only if it is "3" or "4".

Support for option --analysis-mode for pangolin 4 has been provided as of ncov-tools version 1.9.1. The config.yaml file should contain the following entry:

pango_analysis_mode: "accurate"

The available options are: accurate (default) and fast . See the pangolin documentation for further details.

Credit and Acknowledgements

The tree-with-SNPs plot was inspired by a plot shared by Mads Albertsen.
The script to convert variants.tsv files into .vcf files was obtained from: https://github.com/nf-core/viralrecon/blob/dev/bin/ivar_variants_to_vcf.py

Code Snippets

shell:
    "python {params.rename_script} {params.completeness_opt} {input} > {output}"

SnakeMake From line 28 of rules/analysis.smk

shell:
    "augur align --sequences {input.consensus} --reference-sequence {input.reference} --output {output} --fill-gaps"

SnakeMake Augur From line 38 of rules/analysis.smk

shell:
    "augur tree --alignment {input} --output {output}"

SnakeMake Augur From line 47 of rules/analysis.smk

shell:
    "nw_reroot {input} `head -1 {input.reference} | tr -d \">\"` > {output}"

SnakeMake From line 57 of rules/analysis.smk

shell:
    "python {params.alleles_script} --reference-name MN908947.3 {input} > {output}"

SnakeMake From line 68 of rules/analysis.smk

shell:
    "pangolin --outfile {output} {params.pango_analysis_mode} {input}"

SnakeMake Pangolin From line 80 of rules/analysis.smk

shell:
    "pangolin -v > {output}; pangolin -pv >> {output}"

SnakeMake Pangolin From line 91 of rules/analysis.smk

shell:
    'echo {input} | tr " " "\\n" > {output}'

SnakeMake From line 101 of rules/analysis.smk

shell:
    "cat {input} | ncov-watch -m {params.mutation_set} > {output}"

SnakeMake From line 113 of rules/analysis.smk

shell:
    "python {params.masking_script} -b {input.amplicons} -n {input.negative_control_report} -r {input.reference} -g {input.sample_consensus} -o {output}"

SnakeMake From line 128 of rules/analysis.smk

shell:
    "Rscript {params.plot_script} {output} {input}"

SnakeMake From line 138 of rules/analysis.smk

shell:
    "{params.py_script} --alleles {input.alleles} \
                        --coverage {input.samplecoverage} \
                        --variants {input.samplevariants} {params.metadata_opt} \
                        --indel \
                        --consensus {input.sampleconsensus} {params.platform_opt} \
                        --sample {wildcards.sample} \
                        --lineage {input.lineagereport} \
                        --aa_table {input.aa_table} \
                        --mutations {input.watch} {params.pangolin_version_opt} \
                        --run_name {params.run_name_opt} > {output}"

SnakeMake From line 159 of rules/analysis.smk

shell:
    "{params.py_script} --path qc_analysis > {output}"

SnakeMake From line 179 of rules/analysis.smk

shell:
    "cat {input} | awk 'NR == 1 || $0 !~ /qc_pass/' > {output}"

SnakeMake From line 188 of rules/analysis.smk

shell:
    "python {params.script}"

SnakeMake From line 75 of rules/annotation.smk

shell:
    "{params.script} {input} {output}"

SnakeMake From line 85 of rules/annotation.smk

shell:
    "gzip {input}" 

SnakeMake From line 93 of rules/annotation.smk

shell:
    "{params.script} {params.no_log} {params.aa_letter} {params.db} {input} > {output}"

SnakeMake From line 106 of rules/annotation.smk

shell:
    "python {params.script} --file {input.vcf} --sample {params.sample} --output {output}"

SnakeMake From line 118 of rules/annotation.smk

shell:
    "Rscript {params.script} --path qc_annotation --output {output} --threshold {params.threshold}"

SnakeMake From line 129 of rules/annotation.smk

shell:
    "bedtools intersect -a {input.vcf} -b {input.primer_bed} -wb > {output}"

SnakeMake BEDTools From line 138 of rules/annotation.smk

shell:
    "python {params.script} --sample-name {wildcards.sample} --primer-snp-bed {input.primer_snps} --amplicon-depth-tsv {input.amplicon_depth} > {output}"

SnakeMake From line 149 of rules/annotation.smk

shell:
    "cat {input} | awk 'NR == 1 || $0 !~ /primer_name/' > {output}"

SnakeMake From line 157 of rules/annotation.smk

shell:
    "python {params.script} --input-tsv {input} > {output}"

SnakeMake From line 167 of rules/annotation.smk

shell:
    "{params.script} --primers {input.primers} --offset {params.offset} --bed_type {params.bed_type_opt} --output {output} --primer_prefix {params.primer_prefix}"

SnakeMake From line 259 of rules/common.smk

shell:
    "{params.script} --primers {input.primers} --bed_type full --output {output} --primer_prefix {params.primer_prefix}"

SnakeMake From line 270 of rules/common.smk

shell:
    "{params.exec} {input}"

SnakeMake From line 280 of rules/common.smk

shell:
    "cat {input} | awk '{{ print $1 \"\t0\t\" $2 }}' > {output}"

SnakeMake From line 290 of rules/common.smk

shell:
    "python {params.script} --primer_prefix {params.primer_prefix} {input.bed} > {output}"

SnakeMake From line 38 of rules/report.smk

shell:
    "python {params.script} --fpileup {input.fpileups} --alleles {input.alleles} > {output}"

SnakeMake From line 50 of rules/report.smk

shell:
    "python {params.script} --alleles {input.alleles} --min-count 3 > {output}"

SnakeMake From line 61 of rules/report.smk

shell:
    "python {params.script} --run-name {params.run_name} --voc-lineages {params.voc_lineages} {params.platform_opt} > {output}"

SnakeMake From line 75 of rules/report.smk

shell:
    "echo -e \"reference_name\tstart\tend\tamplicon_id\tpool\tstrand\tread_count\tcovered_bases\tamplicon_length\tfraction_covered\" > {output};"
    "bedtools coverage -a {input.bed} -b {input.bam} >> {output}"

SnakeMake BEDTools From line 21 of rules/sequencing.smk

shell:
    "echo -e \"reference_name\tstart\tend\tamplicon_id\tpool\tstrand\tmean_depth\" > {output};"
    "bedtools coverage -mean -a {input.bed} -b {input.bam} >> {output}"

SnakeMake BEDTools From line 36 of rules/sequencing.smk

shell:
    "echo -e \"reference_name\tstart\tend\tamplicon_id\tpool\tstrand\tposition\tdepth\" > {output};"
    "bedtools coverage -d -a {input.bed} -b {input.bam} >> {output}"

SnakeMake BEDTools From line 51 of rules/sequencing.smk

shell:
    "echo -e \"reference_name\tstart\tend\tposition\tdepth\" > {output};"
    "bedtools coverage -d -a {input.bed} -b {input.bam} >> {output}"

SnakeMake BEDTools From line 65 of rules/sequencing.smk

shell:
    "ln -s \"$(readlink -f {input})\" {output}"

SnakeMake From line 79 of rules/sequencing.smk

shell:
    "samtools index {input}"

SnakeMake SAMtools From line 88 of rules/sequencing.smk

shell:
    "python {params.pileup_script} --bam {input.bam} --reference {input.reference} > {output}"

SnakeMake From line 101 of rules/sequencing.smk

shell:
    'echo {input} | tr " " "\\n" > {output}'

SnakeMake From line 110 of rules/sequencing.smk

shell:
    "Rscript {params.plot_script} --path qc_sequencing --output {output.plot} --table {output.table}"

SnakeMake From line 126 of rules/sequencing.smk

shell:
    "Rscript {params.plot_script} -t depth_by_position -o {output} {params.metadata_opt}"

SnakeMake From line 145 of rules/sequencing.smk

shell:
    "Rscript {params.plot_script} -t negative_control_depth_by_position -o {output} {params.metadata_opt}"

SnakeMake From line 157 of rules/sequencing.smk

shell:
    "Rscript {params.plot_script} -t amplicon_depth_by_ct -o {output} -m {input.metadata}"

SnakeMake From line 169 of rules/sequencing.smk

shell:
    "Rscript {params.plot_script} -t amplicon_covered_fraction -o {output}"

SnakeMake From line 180 of rules/sequencing.smk

shell:
    "Rscript {params.plot_script} -t genome_completeness_by_ct -o {output} -m {input.metadata}"

SnakeMake From line 192 of rules/sequencing.smk

ShowHide 42 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/jts/ncov-tools

Name: ncov-tools

Version: v1.9.1

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

Sample annotation Sequencing quality control Report Augur BEDTools Pangolin SAMtools Snakemake Genomics

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free