Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Somatic variant calling pipeline for whole exome and whole genome sequencing (WES & WGS). The pipeline uses Mutect2 to identify variants and mostly follows GATK Best Practices. SLURM execution functionality allows the workflow to run on Stanford's Sherlock computing cluster.
Author
- Tomas Bencomo ( https://tjbencomo.github.io )
Getting Help
If you encounter problems using the pipeline, find a bug, or would like to request a new feature, please file an issue .
Prerequisites
References
You'll need a reference genome. The GRCh38 (hg38) genome is available on the Broad's GATK website .
You'll also need the
cache
files for
Variant Annotation Predictor (VEP)
.
Follow the tutorial
here
to download the data files.
By default, the pipeline uses a docker container with VEP v104 and VCF2MAF for variant annotation.
It is suggested you use v104 cache files. You can use your own version of VEP/VCF2MAF by
modifying the
vep_env
field in
config.yaml
.
Input Files
The pipeline takes paired end (PE) FASTQ files as input. Samples can be normal-tumor pairs or
tumor-only individual sample. Samples can be divided into multiple read groups (see
units.csv
) or one pair
of FASTQ files per sample.
You'll also need a BED file with a list of target regions. For WES this is usually the capture kit regions. For WGS this can be the entire genome or a whitelist file that excludes problematic regions. A WGS calling region file is available in the GATK Resource Bundle (it will need to be converted from interval_list format to BED format).
NOTE If you have WES and WGS samples to analyze, create two separate instances of the workflow and run the samples separately.
Software
Snakemake is required to run the pipeline. It is recommended users have Singularity installed to take advantage of preconfigured Docker containers for full reproducibility. If you don't want to use Singularity, you should download all required software (see workflow files) and ensure they are in your path.
Setup
-
Clone this repository or create a new repository using this workflow as a template
-
Edit
patients.csv
andunits.csv
with the details for your analysis. See theschemas/
directory for details about each file. -
Configure
config.yml
. Seeschemas/config.schema.yaml
for info about each required field. Multiplexed samples should be differentiated with thereadgroup
column. Sequencing data must be paired, so bothfq1
andfq2
are required.
Usage
After finishing the setup, inside the repo's base directory with
Snakefile
do a dry run to check for errors
snakemake -n
Once you're ready to run the analysis type
snakemake --use-singularity -j [cores]
This will run the workflow locally. It is recommended you have at least 20GB of storage available as the Singularity containers take up around 8GB and the output files can be very large depending on the number of samples and sequencing depth.
The pipeline produces two key files:
mafs/variants.maf
and
qc/multiqc_report.html
.
variants.maf
includes somatic variants from all samples that passed Mutect2 filtering.
They have been annotated with VEP and a single effect has been chosen by
vcf2maf
using the Ensembl database. Ensembl uses its canonical isoforms for effect selection.
Other isoforms can be specified with the
alternate_isoforms
field in
config.yaml
.
See the
cBioPortal override isoforms
for file formatting.
multiqc_report.html
includes quality metrics like coverage for the fully processed BAM files.
Individual VCF files for each sample prior VEP annotation are found as
vcfs/{patient}.vcf
.
VEP annotated VCFs are found as
vcfs/{patient}.vep.vcf
.
qc/depths.svg
shows the sequencing depth distribution
for normal and tumor samples.
Mutect2 Parallelism
Mutect2 can be scattered into many workers to process different regions of each sample in parallel via the
num_workers
setting in
config.yaml
.
This divides the
genomic_regions
file into many small subregions (the same number of subregions as the number of workers).
Sets of intervals are split between files and individual intervals are not broken. When using the Broad's WGS calling regions interval file,
GATK does not seem to support more than 24 workers (GATK won't create more than 24 subregions). For WES data such as exome capture
kits, GATK will split the regions into more subsets (I have tested up to 50 workers but it can probably do more).
Cluster Execution
If you're using a compute cluster, you can take advantage of massively parallel computation to speed up the analysis. Only SLURM clusters are currently supported, but if you work with another cluster system (SGE etc) Snakemake makes it relatively easy to add support for your cluster.
Follow these instructions to use SLURM execution
-
Modify
cluster.wes.json
if you are analyzing WES data orcluster.wgs.json
for WGS data. Set theaccount
field to your SLURM account (for Sherlock your SUNet ID). The recommended format is/path/slurm-{jobid}.out
. All directories in the path must exist before launch - Snakemake will not create any directories. -
Edit
run_pipeline.sh
. Specify the SLURM directives and set Snakemake to usecluster.wes.json
orcluster.wgs.json
depending on your needs. -
When ready, launch the workflow
sbatch run_pipeline.sh [path to Snakemake directory]
If for some reason you can't leave the master
snakemake
process running,
snakemake
offers the ability to launch all jobs using
--immediate-submit
. This
approach will submit all jobs to the queue immediately and finish the master
snakemake
process. The downside of this method is that temporary files are not supported, so
the results directories will be very cluttered.
To use
--immediate-submit
follow these steps
-
Give
parseJobID.sh
permission to run
chmod +x parseJobID.sh
-
Submit the
snakemake
jobs
snakemake --cluster-config cluster.json --cluster 'sbatch $(./parseJobID.sh {dependencies}) -t {cluster.time} --mem {cluster.mem} -p {cluster.partition} -c {cluster.ncpus} - o {cluster.out}' --jobs 100 --notemp --immediate-submit
Deviations from Broad Pipeline
ngs-pipeline
differs from the Broad's Somatic Variant pipeline in the following ways:
-
By default
FilterMutectCalls
is run with additional flags. This can be changed inconfig.yaml
. The default configuration requires all variants must have 1 read from each direction to be included in the final callset. -
We provide the option for more stringent variant filtering criteria with the
stringent_filtering
setting inconfig.yaml
. This is turned off by default
Test Dataset
A small sample of
chr21
reads are supplied from the
Texas Cancer Research Biobank
for
end-to-end pipeline tests.
This data is made available as open access data with minimal privacy
restrictions. Please read the
Conditions of Use
before using the data.
Tips
FASTQ Formatting
The first line of each read (called the sequence identifier) should consist of a single string not separated by any spaces. An example of a properly sequence identifier is
@MGILLUMINA4_74:7:1101:10000:100149
Sequence identifiers that consist of multiple space separated words will cause problems because
gatk
captures the entire
string and uses it as the read ID but
bwa
only parses the first word as the read group when it writes aligned reads
to BAM files. This sequence identifier would break the pipeline
@MGILLUMINA4_74:7:1101:10000:100149 RG:Z:A470018/1
Check the sequence identifiers if you encounter a
Aligned record iterator is behind the unmapped reads
error.
Citations
This pipeline is based on
dna-seq-gatk-variant-calling
by
Johannes Köster
.
If you use
ngs-pipeline
, please use
citations.md
to cite the necessary software tools.
Code Snippets
27 28 29 30 31 32 33 34 35 36 37 | shell: """ gatk Mutect2 -R {input.ref} -I {input.tumor} \ -tumor {params.tumor} -O {output.vcf} \ --germline-resource {input.germ_res} \ --f1r2-tar-gz {output.f1r2tar} \ -L {input.interval} \ -ip 100 \ {params.normal_input} {params.normalname} \ {params.pon} {params.extra} """ |
47 48 49 50 | shell: """ gatk LearnReadOrientationModel {params.i} -O {output} """ |
59 60 61 62 63 | shell: """ gatk GetPileupSummaries -I {input.bam} -V {input.germ_res} \ -L {input.germ_res} -O {output} """ |
73 74 75 76 77 78 | shell: """ gatk CalculateContamination -I {input.tumor} \ {params.matched} \ -O {output} """ |
88 89 90 91 | shell: """ gatk MergeVcfs {params.i} -O {output.vcf} """ |
101 102 103 104 | shell: """ gatk MergeMutectStats {params.i} -O {output.stats} """ |
120 121 122 123 124 125 126 127 128 | shell: """ gatk FilterMutectCalls -V {input.vcf} -R {input.ref} \ --contamination-table {input.contamination} \ --stats {input.stats} \ -ob-priors {input.f1r2model} \ -O {output.vcf} \ {params.extra} """ |
139 140 141 142 143 144 145 146 | shell: """ gatk SelectVariants -V {input.vcf} \ -R {input.ref} \ -O {output.vcf} \ --exclude-filtered \ -OVI """ |
160 161 162 163 164 165 166 167 168 169 170 171 172 173 | shell: """ vep_fp=`which vep` vep_path=$(dirname "$vep_fp") vcf2maf.pl --input-vcf {input.vcf} --output-maf {output.maf} \ --tumor-id {wildcards.patient}.tumor \ --ref-fasta {input.fasta} \ --vep-data {input.vep_dir} \ --ncbi-build {params.assembly} \ --vep-path $vep_path \ --maf-center {params.center} \ {params.normalid} \ {params.isoforms} """ |
183 184 | script: "../scripts/combine_mafs.py" |
15 16 17 18 19 20 21 22 23 24 | shell: """ gatk Mutect2 \ -I {input.bam} \ -R {input.ref} \ -O {output.vcf} \ -L {input.intervals} \ -ip 100 \ --max-mnp-distance 0 """ |
35 36 37 38 39 40 41 42 43 | shell: """ gatk GenomicsDBImport \ -R {input.ref} \ --genomicsdb-workspace-path {output} \ -V {params.vcfs} \ -L {input.intervals} \ -ip 100 """ |
53 54 55 56 | shell: """ gatk CreateSomaticPanelOfNormals -R {input.ref} -V gendb://{input.var} -O {output.vcf} """ |
19 20 21 22 23 24 25 26 27 | shell: """ gatk FastqToSam -F1 {input.r1} -F2 {input.r2} \ -O {output} \ -SM {wildcards.patient}.{wildcards.sample_type} \ -RG {wildcards.patient}.{wildcards.sample_type}.{wildcards.readgroup} \ --TMP_DIR {params.tmp} \ -PL {params.pl} """ |
36 37 38 39 | shell: """ bwa index {input} """ |
50 51 52 53 54 55 56 57 58 | shell: """ gatk SamToFastq -I {input.bam} -F /dev/stdout -INTER true -NON_PF true \ | \ bwa mem -p -v 3 -t {threads} -T 0 \ {input.ref} /dev/stdin - 2> >(tee {log} >&2) \ | \ samtools view -Shb -o {output} """ |
70 71 72 73 74 75 76 77 78 79 | shell: """ gatk MergeBamAlignment \ -R {input.ref} \ -UNMAPPED {input.unaligned} \ -ALIGNED {input.aligned} \ -O {output} \ --SORT_ORDER queryname \ --TMP_DIR {params.tmp} """ |
93 94 95 96 97 98 99 100 101 | shell: """ gatk MarkDuplicatesSpark \ -I {params.inbams} \ -O {output.bam} \ --tmp-dir {params.tmp} \ --conf 'spark.executor.cores={threads}' \ --conf 'spark.local.dir={params.tmp}' """ |
114 115 116 117 118 | shell: """ gatk BaseRecalibrator -I {input.bam} -R {input.ref} -O {output.recal} \ {params.ks} --tmp-dir {params.tmp} """ |
131 132 133 134 135 136 137 138 139 140 | shell: """ gatk ApplyBQSR \ -I {input.bam} \ -R {input.ref} \ -O {output.bam} \ -bqsr {input.recal} \ --add-output-sam-program-record \ --tmp-dir {params.tmp} """ |
154 155 156 157 158 | shell: """ mosdepth --by {params.by} -t {threads} qc/{wildcards.patient}.{wildcards.sample_type} \ {input.bam} """ |
166 167 168 169 | shell: """ samtools flagstat {input} > {output} """ |
178 179 180 181 182 183 184 185 186 | shell: """ tmpdir=qc/fastqc/{wildcards.patient}-{wildcards.sample_type}.tmp mkdir $tmpdir fastqc --outdir $tmpdir {input} mv $tmpdir/{wildcards.patient}.{wildcards.sample_type}_fastqc.html {output.html} mv $tmpdir/{wildcards.patient}.{wildcards.sample_type}_fastqc.zip {output.zipdata} rm -r $tmpdir """ |
198 199 | wrapper: "0.50.4/bio/multiqc" |
207 208 | script: "../scripts/gather_depths.py" |
216 217 | script: "../scripts/plot_depth.R" |
229 230 231 232 233 234 | shell: """ gatk SplitIntervals -R {input.ref} -L {input.intervals} \ --scatter-count {params.N} -O {params.d} \ --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION """ |
242 243 244 245 246 247 248 | shell: """ gatk BedToIntervalList \ -I {input.bed} \ -SD {input.d} \ -O {output} """ |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | import os import sys import pandas as pd def concat_mafs(files): mafs = [] for f in files: df = pd.read_csv(f, sep = "\t", skiprows=1) mafs.append(df) all_mafs = pd.concat(mafs) all_mafs['patient'] = all_mafs['Tumor_Sample_Barcode'].str.replace(".tumor", "", regex=False) if all_mafs.shape[0] == 0: print("WARNING: There are no variants in your MAF file!") return all_mafs def main(): files = snakemake.input maf = concat_mafs(files) if snakemake.params[0] is True: maf = maf.query('t_depth >= 15 & t_alt_count >= 5') maf.to_csv(snakemake.output[0], sep="\t", index=False, header=True) if __name__ == '__main__': main() |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | import os import pandas as pd col_names = ['chr', 'start', 'end', 'depth'] def parse_summaries(files): depths = [] for f in files: df = pd.read_csv(f, sep='\t') s = os.path.basename(f).split('.') patient = s[0] sample_type = s[1] depth = df.query("chrom == 'total_region'") depth['patient'] = patient depth['sample_type'] = sample_type depths.append(depth) return pd.concat(depths, ignore_index = True) def main(): files = snakemake.input df = parse_summaries(files) df = df[['patient', 'sample_type', 'chrom', 'length', 'bases', 'mean', 'min', 'max']] df.to_csv(snakemake.output[0], index=False, header=True) if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | Sys.setenv("TZDIR"=paste0(Sys.getenv("CONDA_PREFIX"), "/share/zoneinfo")) library(readr) library(dplyr) library(ggplot2) df <- read_csv(snakemake@input[[1]]) df <- df %>% mutate(sample_type = factor(sample_type, levels = c("normal", "tumor"))) p <- df %>% ggplot(aes(sample_type, mean)) + geom_boxplot(aes(fill = sample_type)) + labs(x = "", y = "Average Depth") + guides(fill = guide_legend(title="")) + ggtitle("Sequencing Depths") ggsave(snakemake@output[[1]], p) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
Support
- Future updates
Related Workflows





