Files and methodology pertaining to the sequencing and analysis of SARS-CoV-2, causative agent of COVID-19.
This is a complete standardized workflow the assembly and subsequent analysis for short-read viral sequencing.
This core workflow is compatible with the
illumina artic nf pipeline
and produces consensus and variant calls using
iVar
(1.3)
(Grubaugh, 2019)
and
Freebayes
(Garrison, 2012)
.
However, it performs far more extensive quality control and visualisation of results including an interactive HTML summary of run results.
Briefly, raw reads undergo qc using
fastqc
(Andrews)
before removal of host-related reads by competitive mapping against a composite host and human reference with
BWA-MEM
(0.7.5)
(Li, 2013)
,
samtools
(1.9)
(Li, 2009)
, and a
custom script
.
This is to ensure raw as data as possible can be deposited in central databases.
After this, reads undergo adapter trimming and further qc with
trim-galore
(0.6.5)
(Martin)
.
Residual truseq sequencing adapters are then removed through another
custom script
.
Reads are then mapped to the viral reference with
BWA-MEM
, and amplicon primer sequences trimmed using
ivar
(1.3)
(Grubaugh, 2019)
.
Fastqc is then used to perform a QC check on the reads that map to the viral reference.
After this,
iVar
is used to generate a consensus genome and variants are called using both
ivar variants
and
breseq
(0.35)
(Deatherage, 2014)
. Additionally,
Freebayes
may be run in addition to
iVar
which will generate a second set of consensus genome(s) and variant call(s) with comparisons made between
iVar
and
FreeBayes
to highlight differences in mutation calls.
Coverage statistics are calculated using bedtools before a final QC via
quast
and a
kraken2
taxonomic classification of mapped reads.
Finally, data from all samples are collated via a
post-processing script
into an interactive summary for exploration of results and quality control.
Optionally, users can run
ncov-tools
to generate additional quality control and summary plots and statistics.
If you use this software please cite :
Nasir, Jalees A., Robert A. Kozak, Patryk Aftanas, Amogelang R. Raphenya, Kendrick M. Smith, Finlay Maguire, Hassaan Maan et al. "A Comparison of Whole Genome Sequencing of SARS-CoV-2 Using Amplicon-Based Sequencing, Random Hexamers, and Bait Capture." Viruses 12, no. 8 (2020): 895.
https://doi.org/10.3390/v12080895
Contents:
Setup:
0. Clone the git repository (
--recursive
only needed to run
ncov-tools
postprocessing)
git clone --recursive https://github.com/jaleezyy/covid-19-signal
1. Install
conda
and
snakemake
(version >5) e.g.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh # follow instructions
source $(conda info --base)/etc/profile.d/conda.sh
conda create -n signal -c conda-forge -c bioconda -c defaults snakemake pandas conda mamba
conda activate signal
There are some issues with
conda
failing to install newer versions of snakemake
so alternatively install
mamba
and use that (snakemake has beta support for it within the workflow)
conda install -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n signal snakemake pandas conda mamba
conda activate signal
# mamba activate signal is equivalent
Additional software dependencies are managed directly by
snakemake
using conda environment files:
-
trim-galore 0.6.5 ( docs )
-
kraken2 2.1.1 ( docs )
-
quast 5.0.2 ( docs )
-
bwa 0.7.17 ( docs )
-
samtools 1.7/1.9 ( docs )
-
bedtools 2.26.0 ( docs )
-
breseq 0.35.0 ( docs )
-
ivar 1.3 ( docs )
-
freebayes 1.3.2 ( docs )
-
pangolin (latest; version can be specified by user) (docs)
-
pangolin-data (latest; version can be specified by user; required for Pangolin v4+) (docs)
-
pangolearn (latest; version can be specified by user) (docs)
-
constellations (latest; version can be specified by user) (docs)
-
scorpio (latest; version can be specified by user) (docs)
-
pango-designation (latest; version can be specified by user) (docs)
-
nextclade (v1.11.0) (docs)
-
ncov-tools postprocessing scripts require additional dependencies (see file ).
SIGNAL Help Screen:
Using the provided
signalexe.py
script, the majority of SIGNAL functions can be accessed easily.
To display the help screen:
python signalexe.py -h
usage: signalexe.py [-h] [-c CONFIGFILE] [-d DIRECTORY] [--cores CORES] [--config-only] [--remove-freebayes] [--add-breseq] [-neg NEG_PREFIX] [--dependencies] [--data DATA] [-ri] [-ii] [--unlock]
[-F] [-n] [--quiet] [--verbose] [-v]
[all ...] [postprocess ...] [ncov_tools ...] [install ...]
SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL) aims to take Illumina short-read sequences and perform consensus assembly + variant calling for ongoing surveillance and research efforts towards
the emergent coronavirus: Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2).
positional arguments:
all Run SIGNAL with all associated assembly rules. Does not include postprocessing '--configfile' or '--directory' required. The latter will automatically generate a
configuration file and sample table. If both provided, then '--configfile' will take priority
postprocess Run SIGNAL postprocessing on completed SIGNAL run. '--configfile' is required but will be generated if '--directory' is provided
ncov_tools Generate configuration file and filesystem setup required and then execute ncov-tools quality control assessment. Requires 'ncov-tools' submodule! '--configfile' is required
but will be generated if '--directory' is provided
install Install individual rule environments and ensure SIGNAL is functional. The only parameter operable will be '--data'. Will override other operations!
optional arguments:
-h, --help show this help message and exit
-c CONFIGFILE, --configfile CONFIGFILE
Configuration file (i.e., config.yaml) for SIGNAL analysis
-d DIRECTORY, --directory DIRECTORY
Path to directory containing reads. Will be used to generate sample table and configuration file
--cores CORES Number of cores. Default = 1
--config-only Generate sample table and configuration file (i.e., config.yaml) and exit. '--directory' required
--remove-freebayes Configuration file generator parameter. Set flag to DISABLE freebayes variant calling (improves overall speed)
--add-breseq Configuration file generator parameter. Set flag to ENABLE optional breseq step (will take more time for analysis to complete)
-neg NEG_PREFIX, --neg-prefix NEG_PREFIX
Configuration file generator parameter. Comma-separated list of negative control sample name(s) or prefix(es). For example, 'Blank' will cover Blank1, Blank2, etc.
Recommended if running ncov-tools. Will be left empty, if not provided
--dependencies Download data dependencies (under a created 'data' directory) required for SIGNAL analysis and exit. Note: Will override other parameters! (~10 GB storage required)
--data DATA SIGNAL install and data dependencies parameter. Set location for data dependancies. If '--dependancies' is run, a folder will be created in the specified directory. If '--
config-only' or '--directory' is used, the value will be applied to the configuration file. (Upcoming feature): When used with 'SIGNAL install', any tests run will use the
dependencies located at this directory. Default = 'data'
-ri, --rerun-incomplete
Snakemake parameter. Re-run any incomplete samples from a previously failed run
-ii, --ignore-incomplete
Snakemake parameter. Do not check for incomplete output files
--unlock Snakemake parameter. Remove a lock on the working directory after a failed run
-F, --forceall Snakemake parameter. Force the re-run of all rules regardless of prior output
-n, --dry-run Snakemake parameter. Do not execute anything and only display what would be done
--quiet Snakemake parameter. Do not output any progress or rule information. If used with '--dry-run`, it will just display a summary of the DAG of jobs
--verbose Snakemake parameter. Display snakemake debugging output
-v, --version Display version number
Summary:
signalexe.py
simplies the execution of all functions of SIGNAL. At its simplest, SIGNAL can be run with one line, provided only the directory of sequencing reads.
# Download dependances (only needs to be run once; ~10GB of storage required)
# --data flag allows you to rename and relocate dependencies directory
python signalexe.py --data data --dependencies
# Generate configuration file and sample table
# --neg_prefix can be used to note negative controls
# --data can be used to specify location of data dependencies
python signalexe.py --config-only --directory /path/to/reads
# Execute pipeline (step-by-step; --cores defaults to 1 if not provided)
# --data can be used to specify location of data dependencies
python signalexe.py --configfile config.yaml --cores NCORES all
python signalexe.py --configfile config.yaml --cores NCORES postprocess
python signalexe.py --configfile config.yaml --cores NCORES ncov_tools
# ALTERNATIVE
# Execute pipeline (one line)
# --data can be used to specify location of data dependencies
python signalexe.py --configfile config.yaml --cores NCORES all postprocess ncov_tools
# ALTERNATIVE
# Execute pipeline (one line; no prior configuration file or sample table steps)
# --directory can be used in place of --configfile to automatically generate a configuration file
# --data can be used to specify location of data dependencies
python signalexe.py --directory /path/to/reads --cores NCORES all postprocess ncov_tools
Each of the steps in SIGNAL can be run manually by accessing the individual scripts or running snakemake.
# Download dependances (only needs to be run once; ~10GB of storage required)
bash scripts/get_data_dependencies.sh -d data -a MN908947.3
# Generate sample table
# Modify existing 'example_config.yaml' for your configuration file
bash scripts/generate_sample_table.sh -d /path/to/reads -n sample_table.csv
# Execute pipeline (step-by-step)
snakemake -kp --configfile config.yaml --cores NCORES --use-conda --conda-prefix=$PWD/.snakemake/conda all
snakemake -kp --configfile config.yaml --cores NCORES --use-conda --conda-prefix=$PWD/.snakemake/conda postprocess
snakemake -kp --configfile config.yaml --cores NCORES --use-conda --conda-prefix=$PWD/.snakemake/conda ncov_tools
Detailed setup and execution:
1. Download necessary database files:
The pipeline requires:
-
Amplicon primer scheme sequences
-
SARS-CoV2 reference fasta
-
SARS-CoV2 reference gbk
-
SARS-CoV2 reference gff3
-
kraken2 viral database
-
Human GRCh38 reference fasta (for composite human-viral BWA index)
python signalexe.py --dependencies # defaults to a directory called `data` in repository root # --data can be used to rename and relocate the resultant directory OR bash scripts/get_data_dependencies.sh -d data -a MN908947.3 # allows you to rename and relocate the resultant directory
Note: Downloading the database files requires ~10GB of storage, with up to ~35GB required for all temporary downloads!
1.5. Prepare per-rule conda environments (optional, but recommended):
SIGNAL uses controlled conda environments for individual steps in the workflow. These are generally produced upon first execution of SIGNAL with input data; however, an option to install the per-rule environments is available through the
signalexe.py
script.
python signalexe.py install
# Will install per-rule environments
# Later versions of SIGNAL will include a testing module with curated data to ensure function
2. Generate configuration file:
You can use the
--config-only
flag to generate both
config.yaml
and
sample_table.csv
. The directory provided will be used to auto-generate a name for the run.
python signalexe.py --config-only --directory /path/to/reads
# Outputs: 'reads_config.yaml' and 'reads_sample_table.csv'
# --data can be used to specify the location of data dependancies
You can also create the configuration file through modifying the
example_config.yaml
to suit your system.
Note: Regardless of method, double-check your configuraation file to ensure the information is correct!
3. Specify your samples in CSV format:
See the example table
example_sample_table.csv
for an idea of how to organise this table.
Using the
--config-only
flag, both configuration file and sample table will be generated (see above in step 2) from a given directory path to reads.
Alternatively, you can attempt to use
generate_sample_table.sh
to circumvent manual creation of the table.
bash scripts/generate_sample_table.sh
Output:
You must specify a data directory containing fastq(.gz) reads.
ASSUMES FASTQ FILES ARE NAMED AS <sample_name>_L00#_R{1,2}*.fastq(.gz)
Flags:
-d : Path to directory containing sample fastq(.gz) files (Absolute paths preferred for consistency, but can use relative paths)
-n : Name or file path for final sample table (with extension) (default: 'sample_table.csv') - will overwrite if file exists
-e : Name or file path for an existing sample table - will append to the end of the provided table
Select one of '-n' (new sample table) or '-e' (existing sample table).
If neither provided, a new sample table called 'sample_table.csv' will be created (or overwritten) by default.
General usage:
# Create new sample table 'sample_table.csv' given path to reads directory
bash scripts/generate_sample_table.sh -d /path/to/reads -n sample_table.csv
# Append to existing sample table 'sample_table.csv' given path to a directory with additional reads
bash scripts/generate_sample_table.sh -d /path/to/more/reads -e sample_table.csv
4. Execute pipeline:
For the main
signalexe.py
script, positional arguments inform the rules of the pipeline to execute with flags supplementing input parameters.
The main rules of the pipeline are as followed:
-
all
= Sequencing pipeline. i.e., take a set of paired reads, perform reference-based assembly to generate a consensus, run lineage assignment, etc. -
postprocess
= Summarize the key results including pangolin lineage, specific mutations, etc, after runningall
-
ncov_tools
= Create the required conda environment, generate the necessary configuration file, and link needed result files within thencov-tools
directory.ncov-tools
will then be executed with output found within the SIGNAL directory.
The generated configuration file from the above steps can be used as input. To run the general pipeline:
python signalexe.py --configfile config.yaml --cores 4 all
is equivalent to running
snakemake -kp --configfile config.yaml --cores 4 --use-conda --conda-prefix=$PWD/.snakemake/conda all
You can run the snakemake command as written above, but note that if the
--conda-prefix
is not set as this (i.e.,
$PWD/.snakemake/conda
), then all envs will be reinstalled for each time you change the
results_dir
in the
config.yaml
.
Alternatively, you can skip the above configuration and sample table generation steps by simply providing the directory of reads to the main script (see step 2):
python signalexe.py --directory /path/to/reads --cores 4 all
A configuartion file and sample table will automatically be generated prior to running SIGNAL
all
.
FreeBayes variant calling and BreSeq mutational analysis are technically optional tools within the workflow. Using the
--directory
flag, by default, FreeBayes
will run
and BreSeq
will not
. These can be changed by using the
--remove-freebayes
and
--add-breseq
flags, respectively.
5. Postprocessing analyses:
As with the general pipeline, the generated configuration file from the above steps can be used as input. To run
postprocess
which summarizes the SIGNAL results:
python signalexe.py --configfile config.yaml --cores 1 postprocess
is equivalent to running
snakemake -kp --configfile config.yaml --cores 1 --use-conda --conda-prefix=$PWD/.snakemake/conda postprocess
After postprocessing finishes, you'll see the following summary files:
- summary.html top-level summary, with links to per-sample summaries
- {sample_name}/sample.html per-sample summaries, with links for more detailed info
- {sample_name}/sample.txt per-sample summaries, in text-file format instead of HTML
- summary.zip zip archive containing all of the above summary files.
Note that the pipeline postprocessing (
snakemake postprocess
) is separated from
the rest of the pipeline (
snakemake all
). This is because in a multi-sample run,
it's likely that at least one pipeline stage will fail. The postprocessing script
should handle failed pipeline stages gracefully, by substituting placeholder values
when expected pipeline output files are absent. However, this confuses snakemake's
dependency tracking, so there seems to be no good alternative to separating piepline
processing and postprocessing into
all
and
postprocess
targets.
Related: because pipeline stages can fail, we run (and recommend running if using the snakemake command to run SIGNAL)
snakemake all
with the
-k
flag ("Go on with independent jobs if a job fails").
Additionally, SIGNAL can prepare output and execute @jts' ncov-tools to generate phylogenies and alternative summaries.
python signalexe.py --configfile config.yaml --cores 1 ncov_tools
is equivalent to running
snakemake -kp --configfile config.yaml --cores 1 --use-conda --conda-prefix=$PWD/.snakemake/conda ncov_tools
SIGNAL manages installing the dependencies (within the
conda_prefix
) and will generate the necessary hard links to required input files from SIGNAL for
ncov-tools
if it has been cloned as a sub-module (if not found, the script will attempt to pull the submodule) and a fasta containing sequences to include in the tree has been specified using
phylo_include_seqs:
in the main SIGNAL
config.yaml
. If
run_freebayes
is set to
True
, then SIGNAL will attempt to link the FreeBayes consensus FASTA and variant files, if found. Otherwise, the corresponding iVar files will be used instead.
SIGNAL will then execute ncov-tools and the
output will be found within the SIGNAL results directory, specified in SIGNAL's configuration file, under
ncov-tools-results
. Refer to the ncov-tools documentation for information regarding specific output.
Multiple operations:
Using
signalexe.py
positional arguments, you can specify SIGNAL to perform multiple rules in succession.
python signalexe.py --configfile config.yaml --cores NCORES all postprocess ncov_tools
In the above command, SIGNAL
all
,
postprocess
, and
ncov_tools
will run using the provided configuration file as input, which links to a sample table.
Note: Regardless of order for positional arguments, or placement of other parameter flags, SIGNAL will always run in the set order priority:
all
>
postprocess
>
ncov_tools
!
Note: If
install
is provided as input, it will override all other positional arguments!
If no configuration file or sample table was generated for a run, you can provide
--directory
with the path to sequencing reads and SIGNAL will auto-generate both required inputs prior to running any rules.
python signalexe.py --directory /path/to/reads --cores NCORES all postprocess ncov_tools
Overall, this simplifies executing SIGNAL to one line!
Docker:
Alternatively, the pipeline can be deployed using Docker (see
resources/Dockerfile_pipeline
for specification).
To pull from dockerhub:
docker pull finlaymaguire/signal
Download data dependencies into a data directory that already contains your reads (
data
is this example but whatever name you wish to use):
mkdir -p data && docker run -v $PWD/data:/data finlaymaguire/signal:latest bash scripts/get_data_dependencies.sh -d /data
Generate your
config.yaml
and
sample_table.csv
(with paths to the readsets underneath
/data
) and place them into the data directory:
cp config.yaml sample_table.csv $PWD/data
WARNING
result_dir
in
config.yaml
must be within
/data
e.g.
/data/results
to automatically be copied to your host system. Otherwise they will be automatically deleted when the container finishes running (unless docker is run interactively).
Then execute the pipeline:
docker run -v $PWD/data:/data finlaymaguire/signal conda run -n snakemake snakemake --configfile /data/config.yaml --use-conda --conda-prefix /covid-19-signal/.snakemake/conda --cores 8 all
Data Summaries:
-
postprocess
andncov_tools
as described above generate many summaries including interactive html reports -
To generate summaries of BreSeq among many samples, see how to summarize BreSeq results using gdtools
Convenient extraction script:
SIGNAL produces several output files and directories on its own alongside the output for ncov-tools. Select files from the output can be copied or transferred for easier parsing using a provided convenience bash script:
bash scripts/get_signal_results.sh
Usage:
bash get_signal_results.sh -s <SIGNAL_results_dir> -d <destination_dir> [-m] [-c]
This scripts aims to copy (rsync by default, or cp) or move (mv) select output from SIGNAL 'all', 'postprocess', and 'ncov_tools'.
The following files will be transferred over to the specified destination directory (if found):
SIGNAL 'all' & 'postprocess':
-> signal-results/<sample>/<sample>_sample.txt
-> signal-results/<sample>/core/<sample>.consensus.fa
-> signal-results/<sample>/core/<sample>_ivar_variants.tsv
-> signal-results/<sample>/freebayes/<sample>.consensus.fasta
-> signal-results/<sample>/freebayes/<sample>.variants.norm.vcf
SIGNAL 'ncov_tools':
-> ncov_tools-results/qc_annotation/<sample>.ann.vcf
-> ncov-tools-results/qc_reports/<run_name>_ambiguous_position_report.tsv
-> ncov-tools-results/qc_reports/<run_name>_mixture_report.tsv
-> ncov-tools-results/qc_reports/<run_name>_ncov_watch_variants.tsv
-> ncov-tools-results/qc_reports/<run_name>_negative_control_report.tsv
-> ncov-tools-results/qc_reports/<run_name>_summary_qc.tsv
Flags:
-s : SIGNAL results directory
-d : Directory where summary will be outputted
-m : Invoke 'mv' move command instead of 'rsync' copying of results. Optional
-c : Invoke 'cp' copy command instead of 'rsync' copying of results. Optional
The script uses
rsync
to provide accurate copies of select output files organized into
signal-results
and
ncov-tools-results
within a provided destination directory (that must exist). If the
-c
is provided,
cp
will be used instead of
rsync
to produce copies. Similarly, if
-m
is provided,
mv
will be used instead (
WARNING: Any interruption during
mv
could result in data loss.
)
Pipeline details:
For a step-by-step walkthrough of the pipeline, see pipeline/README.md .
A diagram of the workflow is shown below (update pending).
Code Snippets
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 | import os, sys import shutil import subprocess import fileinput import glob def link_ivar(root, neg, failed, replace=False): print("Linking iVar files to ncov-tools!") for variants in snakemake.input['variants']: sample = variants.split('/')[0] if (sample in failed) and (sample not in neg): continue ln_path = f"{root}/{sample}.variants.tsv" try: if (not os.path.exists(ln_path)) or (replace is True): # remove equivalent link for Freebayes VCF (namely for replace=True) if os.path.exists(f"{root}/{sample}.variants.vcf"): os.remove(f"{root}/{sample}.variants.vcf") os.link(variants, ln_path) except FileExistsError: # equal link ~ error if replace: os.remove(ln_path) os.link(variants, ln_path) else: pass for consensus in snakemake.input['consensus']: sample = consensus.split('/')[0] if (sample in failed) and (sample not in neg): continue ln_path = f"{root}/{sample}.consensus.fasta" try: if (not os.path.exists(ln_path)) or (replace is True): os.link(consensus, ln_path) except FileExistsError: # more likely given same link path if replace: os.remove(ln_path) os.link(variants, ln_path) else: pass for line in fileinput.input(ln_path, inplace=True): if line.startswith(">"): new_header = str(">"+sample) new_line = line.replace(line, new_header) print(new_line, end='\n') else: print(line, end='\n') # take sample name from iVar results, redirect to where corresponding FreeBayes should be # if FreeBayes file cannot be found, break from loop, replace all with iVar def link_freebayes(root, neg, failed): print("Linking FreeBayes files to ncov-tools!") for variants in snakemake.input['variants']: sample = variants.split('/')[0] if (sample in failed) and (sample not in neg): continue expected_path = os.path.join(sample, 'freebayes', sample+'.variants.norm.vcf') if not os.path.exists(expected_path): print("Missing FreeBayes variant file! Switching to iVar input!") link_ivar(root, neg, failed, replace=True) vcf_method = False break else: vcf_method = True ln_path = f"{root}/{sample}.variants.vcf" # redundant check however, it may play a roll in re-runs if os.path.exists(f"{root}/{sample}.variants.tsv"): os.remove(f"{root}/{sample}.variants.tsv") if not os.path.exists(ln_path): os.link(expected_path, ln_path) if vcf_method: for consensus in snakemake.input['consensus']: sample = consensus.split('/')[0] if (sample in failed) and (sample not in neg): continue expected_path = os.path.join(sample, 'freebayes', sample+'.consensus.fasta') if not os.path.exists(expected_path): print("Missing FreeBayes variant file! Switching to iVar input!") link_ivar(root, neg, failed, replace=True) break else: ln_path = f"{root}/{sample}.consensus.fasta" if not os.path.exists(ln_path): os.link(expected_path, ln_path) for line in fileinput.input(ln_path, inplace=True): if line.startswith(">"): new_header = str(">"+sample) new_line = line.replace(line, new_header) print(new_line, end='\n') else: print(line, end='\n') return vcf_method def set_up(): print("Writing config.yaml for ncov-tools to ncov-tools/config.yaml") #vcf_variant = True # set default exec_dir = snakemake.params['exec_dir'] result_dir = os.path.basename(snakemake.params['result_dir']) # basename of SIGNAL result directory ### Pull pangolin version number (i.e., version '3' or '4') pangolin = str(snakemake.params['pangolin']).split(".")[0].lower().strip("v") try: assert pangolin == "3" or pangolin == "4" # directly supported versions except AssertionError: # import urllib.request as web # commit_url = web.urlopen(f"https://github.com/cov-lineages/pangolin/releases/latest").geturl() # pangolin = commit_url.split("/")[-1].split(".")[0].lower().strip("v") # latest version (should ensure temporary compatibility) installed_versions = subprocess.run(["pangolin", "--all-versions"], check=True, stdout=subprocess.PIPE) installed_versions = installed_versions.stdout.decode('utf-8') installed_ver_dict = {} for dep_ver in map(str.strip, installed_versions.split('\n')): # skip empty line at end if len(dep_ver) == 0: continue try: dependency, version = dep_ver.split(': ') except ValueError: continue if dependency == 'pangolin': pangolin = str(version).split(".",1)[0].strip('v') break ### Create data directory within ncov-tools data_root = os.path.abspath(os.path.join(exec_dir, 'ncov-tools', "%s" %(result_dir))) if os.path.exists(data_root): shutil.rmtree(data_root) os.mkdir(data_root) ### Pull negative samples (based on common identifiers) #neg_names = ("Negative", "NEG", "PCR-NEG", "UTM", "BLANK", "Blank", "blank") neg_names = tuple(snakemake.params['negative_control_prefix']) neg_samples = set() with open(snakemake.params['sample_csv_filename']) as fh: fh.readline() for line in fh: id = line.split(",")[0] if id.startswith(neg_names): neg_samples.add(id) neg_list = list(neg_samples) print("Negative control samples found include: %s" %(neg_list)) ### Pull failed samples (SIGNAL log file: failed_samples.log) if os.path.exists(snakemake.params['failed']): with open(snakemake.params['failed']) as fail: failed_list = [i.strip() for i in fail.readlines()[1:]] else: failed_list = [] print("Failed samples found include: %s" %(failed_list)) ### config.yaml parameters config = {'data_root': f"'{data_root}'", 'run_name': f"'{result_dir}'", # name ncov-tools output files with name of SIGNAL results directory (default: "default") 'amplicon_bed': f"'{snakemake.params['amplicon_bed']}'", #grab from signal snakemake config 'reference_genome': f"'{snakemake.params['viral_reference_genome']}'", #grab from signal snakemake config 'platform': 'illumina', 'primer_bed': f"'{snakemake.params['primer_bed']}'", 'bed_type': "unique_amplicons", 'offset': 0, 'completeness_threshold': 0.9, 'bam_pattern': "'{data_root}/{sample}.bam'", # symlink files following this 'primer_trimmed_bam_pattern': "'{data_root}/{sample}.mapped.primertrimmed.sorted.bam'", 'consensus_pattern': "'{data_root}/{sample}.consensus.fasta'", # symlink files following this 'variants_pattern': "'{data_root}/{sample}.variants.tsv'", 'assign_lineages': 'true', 'tree_include_consensus': f"'{snakemake.params['phylo_include_seqs']}'", 'negative_control_samples': f"{neg_list}", 'mutation_set': 'spike_mutations', 'output_directory': f"{result_dir}_ncovresults", 'pangolin_version': f"'{pangolin}'", 'pango_analysis_mode': f"'{snakemake.params['mode']}'" } print("Linking alignment BAMs to ncov-tools!") for bam in snakemake.input['bams']: sample = bam.split('/')[0] # if sample failed and not a negative, skip linking if (sample in failed_list) and (sample not in neg_list): continue ln_path = f"{data_root}/{sample}.bam" if not os.path.exists(ln_path): os.link(bam, ln_path) for primer_trimmed_bam in snakemake.input['primertrimmed_bams']: sample = primer_trimmed_bam.split('/')[0] if (sample in failed_list) and (sample not in neg_list): continue ln_path = f"{data_root}/{sample}.mapped.primertrimmed.sorted.bam" if not os.path.exists(ln_path): os.link(primer_trimmed_bam, ln_path) if snakemake.params['freebayes_run']: vcf_variant = link_freebayes(data_root, neg_list, failed_list) if vcf_variant: config['variants_pattern'] = "'{data_root}/{sample}.variants.vcf'" else: pass # keep as TSV else: link_ivar(data_root, neg_list, failed_list, replace=False) with open(os.path.join(exec_dir, 'ncov-tools', 'config.yaml'), 'w') as fh: for key, value in config.items(): fh.write(f"{key}: {value}\n") return exec_dir, result_dir, data_root def run_all(): os.system(f"snakemake -s workflow/Snakefile --cores {snakemake.threads} all") def move(cwd, dest, prefix): if os.path.exists(os.path.join(cwd, "ncov-tools", "plots")): extensions = ["_amplicon_coverage_heatmap.pdf", "_amplicon_covered_fraction.pdf", "_depth_by_position.pdf", "_tree_snps.pdf"] for ext in extensions: for file in glob.glob(os.path.join(cwd, "ncov-tools", "plots", "default"+ext)): try: shutil.copy(file, os.path.join(dest, "plots", prefix+ext)) except IOError: print("Missing file %s" %(prefix+ext)) else: print("Missing ncov-tools 'plots' directory") if os.path.exists(os.path.join(cwd, "ncov-tools", "lineages")): for file in glob.glob(os.path.join(cwd, "ncov-tools", "lineages", "default_lineage_report.csv")): try: shutil.copy(file, os.path.join(dest, "lineages", prefix+"_lineage_report.csv")) except IOError: print("Missing file %s_lineage_report.csv" %(prefix)) else: print("Missing ncov-tools 'lineages' directory") if os.path.exists(os.path.join(cwd, "ncov-tools", "qc_analysis")): extensions = ["_aligned-delim.fasta.log", "_aligned-delim.iqtree.log", "_aligned.fasta", "_aligned.fasta.insertions.csv", \ "_aligned.fasta.log", "_alleles.tsv", "_tree.nwk", "_tree_raw.nwk", "_consensus.fasta"] for ext in extensions: for file in glob.glob(os.path.join(cwd, "ncov-tools", "qc_analysis", "default"+ext)): try: shutil.copy(file, os.path.join(dest, "qc_analysis", prefix+ext)) except IOError: print("Missing file %s" %(prefix+ext)) else: print("Missing ncov-tools 'qc_analysis' directory") if __name__ == '__main__': exec_dir, result_dir, data_root = set_up() run_script = os.path.join(exec_dir, 'scripts', 'run_ncov_tools.sh') #print("Don't forget to update the config.yaml file as needed prior to running ncov-tools.") print("Running ncov-tools using %s cores!" %(snakemake.threads)) subprocess.run([run_script, '-c', str(snakemake.threads), '-s', str(result_dir)]) # clean up shutil.rmtree(data_root) #run_all() #move(exec_dir, result_root, result_dir) |
226 227 | shell: '{params.postprocess_script_path} {params.sample_csv_filename}' |
252 | script: "scripts/ncov-tools.py" |
264 265 266 267 268 | shell: """ cp {input.origin_config} {output.config} cp {input.origin_sample_table} {output.sample_table} """ |
278 279 | shell: 'ln -s {input} {output}' |
289 290 | shell: 'if [ $(echo {input} | wc -w) -gt 1 ]; then zcat -f {input} | paste - - - - | sort -k1,1 -t " " | tr "\\t" "\\n" | gzip > {output}; else ln -s {input} {output}; fi' |
307 308 309 310 | shell: """ fastqc -o {params.output_prefix} {input} 2> {log} """ |
331 332 333 334 | shell: '(bwa mem -t {threads} {params.composite_index} ' '{input.raw_r1} {input.raw_r2} | ' "{params.script_path} -c {params.viral_contig_name} > {output}) 2> {log} || echo '' > {output}" |
350 351 352 353 354 | shell: """ samtools view -b {input} | samtools sort -n -@{threads} > {output.bam} 2> {log} samtools fastq -1 {output.r1} -2 {output.r2} -s {output.s} {output.bam} 2>> {log} """ |
379 380 381 382 | shell: 'trim_galore --quality {params.min_qual} --length {params.min_len} ' ' -o {params.output_prefix} --cores {threads} --fastqc ' "--paired {input.raw_r1} {input.raw_r2} 2> {log} || (echo -e 'Total reads processed: 0\nReads written (passing filters): 0 (0.0%)\nTotal basepairs processed: 0 bp\nTotal written (filtered): 0 bp (0.0%)' >> {log}; touch {output})" |
397 398 399 400 | shell: """ python {params.script_path} --input_R1 {input.r1} --input_R2 {input.r2} """ |
415 416 | shell: 'bwa index -p {params.output_prefix} {input} >{log} 2>&1' |
435 436 437 438 | shell: '(bwa mem -t {threads} {params.ref_prefix} ' '{input.r1} {input.r2} | ' 'samtools view -bS | samtools sort -@{threads} -o {output}) 2> {log}' |
460 461 462 463 464 465 466 467 468 | shell: 'samtools view -F4 -o {output.mapped_bam} {input}; ' 'samtools index {output.mapped_bam}; ' 'ivar trim -e -i {output.mapped_bam} -b {params.scheme_bed} ' '-m {params.min_len} -q {params.min_qual} ' '{params.primer_pairs} ' '-p {params.ivar_output_prefix} 2> {log}; ' 'samtools sort -o {output.sorted_trimmed_mapped_bam} ' '{output.trimmed_mapped_bam}' |
485 486 487 488 | shell: """ fastqc -o {params.output_prefix} {input} 2> {log} """ |
504 505 506 507 508 | shell: """ samtools sort -n {input} -o {output.bam} 2> {log} samtools fastq -1 {output.r1} -2 {output.r2} -s {output.s} {output.bam} 2>> {log} """ |
526 527 528 529 530 | shell: '(samtools mpileup -aa -A -d {params.mpileup_depth} -Q0 {input} | ' 'ivar consensus -t {params.ivar_freq_threshold} ' '-m {params.ivar_min_coverage_depth} -n N -p {params.output_prefix}) ' '2>{log}' |
543 544 | shell: 'samtools faidx {input}' |
566 567 568 569 570 571 572 | shell: '(samtools mpileup -aa -A -d 0 --reference {input.reference} -B ' '-Q 0 {input.read_bam} | ' 'ivar variants -r {input.reference} -m {params.ivar_min_coverage_depth} ' '-p {params.output_prefix} -q {params.ivar_min_variant_quality} ' '-t {params.ivar_min_freq_threshold} -g {input.viral_reference_gff}) 2> {log} || ' " (echo -e 'REGION\tPOS\tREF\tALT\tREF_DP\tREF_RV\tREF_QUAL\tALT_DP\tALT_RV\tALT_QUAL\tALT_FREQ\tTOTAL_DP\tPVAL\tPASS\tGFF_FEATURE\tREF_CODON\tREF_AA\tALT_CODON\tALT_AA' > {output}) " |
592 593 594 595 | shell: """ breseq --reference {params.ref} --num-processors {threads} --polymorphism-prediction --brief-html-output --output {params.outdir} {input} > {log} 2>&1 || touch {output} """ |
616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 | shell: """ mkdir -p $(dirname {params.out}) # the sed is to fix the header until a release is made with this fix # https://github.com/freebayes/freebayes/pull/549 freebayes -p 1 \ -f {input.reference} \ -F 0.2 \ -C 1 \ --pooled-continuous \ --min-coverage {params.freebayes_min_coverage_depth} \ --gvcf --gvcf-dont-use-chunk true {input.read_bam} | sed s/QR,Number=1,Type=Integer/QR,Number=1,Type=Float/ > {params.out}.gvcf # make depth mask, split variants into ambiguous/consensus # NB: this has to happen before bcftools norm or else the depth mask misses any bases exposed during normalization python {params.script_path} -d {params.freebayes_min_coverage_depth} \ -l {params.freebayes_min_freq_threshold} \ -u {params.freebayes_freq_threshold} \ -m {params.out}.mask.txt \ -v {params.out}.variants.vcf \ -c {params.out}.consensus.vcf {params.out}.gvcf # normalize variant records into canonical VCF representation bcftools norm -f {input.reference} {params.out}.variants.vcf > {output.variants} || (echo -e '##fileformat=VCFv4.2\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tunknown' > {output.variants}) bcftools norm -f {input.reference} {params.out}.consensus.vcf > {params.out}.consensus.norm.vcf # split the consensus sites file into a set that should be IUPAC codes and all other bases, using the ConsensusTag in the VCF for vt in "ambiguous" "fixed"; do cat {params.out}.consensus.norm.vcf | awk -v vartag=ConsensusTag=$vt '$0 ~ /^#/ || $0 ~ vartag' > {params.out}.$vt.norm.vcf bgzip -f {params.out}.$vt.norm.vcf tabix -f -p vcf {params.out}.$vt.norm.vcf.gz done # apply ambiguous variants first using IUPAC codes. this variant set cannot contain indels or the subsequent step will break bcftools consensus -f {input.reference} -I {params.out}.ambiguous.norm.vcf.gz > {params.out}.ambiguous.fasta # apply remaninng variants, including indels bcftools consensus -f {params.out}.ambiguous.fasta -m {params.out}.mask.txt {params.out}.fixed.norm.vcf.gz | sed s/MN908947\.3.*/{wildcards.sn}/ > {output.consensus} """ |
666 667 668 669 | shell: """ python {params.script_path} -g {input.ivar} -r {input.freebayes} -o vcf > {output} """ |
681 682 | shell: 'bedtools genomecov -d -ibam {input} > {output}' |
692 693 | shell: "python {params.script_path} {input} {output}" |
716 717 718 719 720 721 722 723 724 725 726 727 | shell: 'cd {params.outdir} ' '&& kraken2' ' --db {params.db}' ' --threads {threads}' ' --quick --unclassified-out "{params.labelled_unclassified_reads}"' ' --classified-out "{params.labelled_classified_reads}"' ' --output {params.labelled_output}' ' --paired --gzip-compressed' ' ../../{input.r1} ../../{input.r2}' ' --report {params.labelled_report}' ' 2>../../{log} && (cd ../.. && touch {output})' |
752 753 754 | shell: 'quast {input} -r {params.genome} -g {params.fcoords} --output-dir {params.outdir} --threads {threads} >{log} && ' 'for f in {params.unlabelled_reports}; do mv $f ${{f/report/{params.sample_name}}}; done' |
773 774 775 | shell: 'quast {input} -r {params.genome} -g {params.fcoords} --output-dir {params.outdir} --threads {threads} >{log} && ' 'for f in {params.unlabelled_reports}; do mv $f ${{f/report/{params.sample_name}}}; done' |
782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 | shell: """ cat {input} > {output} sample='' count='' echo "Samples that failed to assemble:" > failed_samples.log while read -r line; do if [[ $line =~ '>' ]]; then sample=$(echo $line | cut -d'.' -f1 | cut -d'_' -f2) else count=$(echo $line | wc -c) if [[ $count -eq 1 ]]; then echo $sample >> failed_samples.log else continue fi fi done < {output} """ |
825 826 827 828 | shell: "echo -e 'pangolin: {params.pangolin_ver}\nconstellations: {params.constellations_ver}\nscorpio: {params.scorpio_ver}\npangolearn: {params.pangolearn_ver}\npango-designation: {params.designation_ver}\npangolin-data: {params.data_ver}' > {output.pango_ver_out} && " "echo -e 'nextclade: {params.nextclade_ver}\nnextclade-dataset: {params.nextclade_data}\nnextclade-include-recomb: {params.nextclade_recomb}' > {output.nextclade_ver_out} && " '{params.assignment_script_path} -i {input} -t {threads} -o {output.lin_out} -p {output.pango_ver_out} -n {output.nextclade_ver_out} --mode {params.analysis_mode}' |
835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 | shell: """ cat {input} > {output} sample='' seq='' count='' out='' if [[ -f 'failed_samples.log' ]]; then out='.failed_freebayes_samples.tmp' cat failed_samples.log | sed 1,1d > $out echo "Samples that failed to assemble:" > failed_samples.log else out='failed_samples.log' echo "Samples that failed to assemble:" > $out fi while read -r line; do if [[ $line =~ '>' ]]; then if [[ $(echo $seq | wc -c) -eq 1 ]]; then # check if new seq count=$(echo $seq | grep -vc 'N') if [[ $count -eq 0 ]]; then echo $sample >> $out fi sample=$(echo $line | cut -d'>' -f2) # start new seq seq='' else sample=$(echo $line | cut -d'>' -f2) # first seq fi else seq+=$line # append seq fi done < {output} if [[ ! $out == 'failed_samples.log' ]]; then sort -b -d -f $out | uniq >> failed_samples.log rm $out fi """ |
886 887 | shell: '{params.assignment_script_path} -i {input.consensus} -t {threads} -o {output} -p {input.p_vers} -n {input.n_vers} --mode {params.analysis_mode} --skip' |
Support
- Future updates
Related Workflows





