An adaptable Snakemake workflow which uses GATKs best practice recommendations to perform germline mutation calling starting with BAM files
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This Snakemake workflow follows the GATK best-practice recommandations to call small germline variants.
The pipeline requires as inputs aligned BAM files (e.g. with BWA ) where the duplicates are already marked (e.g. with Picard or sambamba ). It then performed Base Quality Score Recalibration and joint genotyping of multiple samples , which is automatically parallized over user defined intervals (for examples see intervals.txt ) and chromosomes.
Filtering is performed using GATKs state-of-the-art Variant Quality Score Recalibration
At the end of the worklow, the Variant Effect Predictor is used to annotate the identified germline mutations.
A high level overview of the performed steps can be seen below:
As seen by the execution graph, an arbitrary number of samples/BAM files can be processed in parallel up to the joint variant calling.
Installation
Required tools:
The majority of the listed tools can be quite easily installed with conda which is recommanded.
Usage
First, modify the config_wgs.yaml and resources.yaml files. Both files contain detailed description what is expected. The config_wgs.yaml also contains links to some reference resources. Be careful, they are all specific for the GRCh37/hg19/b37 genome assembly.
After setting up all the config files and installing all tools, you can simply run:
snakemake --latency-wait 300 -j 5 --cluster "sbatch --mem={resources.mem_mb} --time {resources.runtime_min} --cpus-per-task {threads} --job-name={rule}.%j --output snakemake_cluster_submit_log/{rule}.%j.out --mail-type=FAIL"
This assumes that the cluster you are using is running
SLURM
.
If this is not the case, you have to adjust the command after
--cluster
. The log information of each job will be safed in the
snakemake_cluster_submit_log
directory.
This directory will
not
be created automatically.
-j
specifies the number of jobs/rules should be submitted in parallel.
I recommand running this command in a detached session with
tmux
or
screen
.
Output
Below is the output of the
tree
command, after the workflow has finished for one patient H005-00ML.
Usually you would include many patients simultaneously (>50). This is just to illustrate the created output files.
.
├── cohort
│ ├── benchmark
│ │ ├── ApplyVQSR_indel.txt
│ │ ├── ApplyVQSR_snp.txt
│ │ ├── CombineGVCFs.txt
│ │ ├── GenotypeGVCFs.txt
│ │ ├── MergeCohortVCFs.txt
│ │ ├── SelectVariants.txt
│ │ ├── VEP.txt
│ │ ├── VQSR_indel.txt
│ │ └── VQSR_snp.txt
│ ├── cohort.recalibrated.pass.vep.vcf.gz
│ ├── cohort.recalibrated.pass.vep.vcf.gz_summary.html
│ ├── cohort.recalibrated.vcf.gz
│ ├── cohort.recalibrated.vcf.gz.tbi
│ └── logs
│ ├── ApplyVQSR_indel.out
│ ├── ApplyVQSR_snp.out
│ ├── CombineGVCFs
│ ├── CombineGVCFs.1.out
│ ├── CombineGVCFs.2.out
│ ├── ...
│ ├── ...
│ ├── CombineGVCFs.Y.out
│ ├── GenotypeGVCFs.1.out
│ ├── GenotypeGVCFs.2.out
│ ├── ...
│ ├── ...
│ ├── GenotypeGVCFs.Y.out
│ ├── MakeSitesOnly.out
│ ├── MergeCohortVCFs.out
│ ├── SelectVariants.err
│ ├── VEP.out
│ ├── VQSR_indel.out
│ └── VQSR_snp.out
├── config
│ ├── config_wgs.yaml
│ └── resources.yaml
├── H005-00ML
│ ├── benchmark
│ │ ├── ApplyBQSR.txt
│ │ ├── BaseRecalibrator.txt
│ │ ├── GatherBQSRReports.txt
│ │ ├── GatherRecalBamFiles.txt
│ │ ├── HaplotypeCaller.txt
│ │ ├── IndexBam.txt
│ │ ├── MergeHaplotypeCaller.txt
│ │ └── SortBam.txt
│ ├── H005-00ML.germline.merged.g.vcf.gz
│ ├── H005-00ML.germline.merged.g.vcf.gz.tbi
│ └── logs
│ ├── ApplyBQSR
│ ├── ApplyBQSR.0000-scattered.interval_list.out
│ ├── ApplyBQSR.0001-scattered.interval_list.out
│ ├── ...
│ ├── ...
│ ├── ApplyBQSR.0049-scattered.interval_list.out
│ ├── BaseRecalibrator
│ ├── BaseRecalibrator.0000-scattered.interval_list.out
│ ├── BaseRecalibrator.0001-scattered.interval_list.out
│ ├── ...
│ ├── ...
│ ├── BaseRecalibrator.0049-scattered.interval_list.out
│ ├── GatherBQSRReports.out
│ ├── GatherRecalBamFiles.out
│ ├── HaplotypeCaller
│ ├── HaplotypeCaller.0000-scattered.interval_list.out
│ ├── HaplotypeCaller.0001-scattered.interval_list.out
│ ├── ...
│ ├── ...
│ ├── HaplotypeCaller.0049-scattered.interval_list.out
│ ├── IndexBam.out
│ ├── MergeHaplotypeCaller.out
│ └── SortBam.out
├── rules
│ ├── BaseQualityScoreRecalibration.smk
│ ├── JointGenotyping.smk
│ ├── VEP.smk
│ └── VQSR.smk
├── Snakefile
├── snakemake_cluster_submit_log
│ ├── ApplyBQSR.24720887.out
│ ├── ApplyVQSR_snp.24777265.out
│ ├── BaseRecalibrator.24710227.out
│ ├── CombineGVCFs.24772984.out
│ ├── GatherBQSRReports.24715726.out
│ ├── GatherRecalBamFiles.24722478.out
│ ├── GenotypeGVCFs.24773026.out
│ ├── HaplotypeCaller.24769848.out
│ ├── IndexBam.24768728.out
│ ├── MergeCohortVCFs.24776018.out
│ ├── MergeHaplotypeCaller.24772183.out
│ ├── SelectVariants.24777733.out
│ ├── SortBam.24768066.out
│ ├── VEP.24777739.out
│ ├── VQSR_indel.24776035.out
│ └── VQSR_snp.24776036.out
For each analyzed patient, a seperate directory gets created. Along with the patient specific gvcf file, this directory contains log files for all the processing steps that
were performed for that patient (
log
directory) as well as benchmarks for each rule, e.g. how long the step took or how much CPU/RAM was used (
benchmark
directory).
The
cohort
directory contains the multi-sample VCF file, which gets created after performing the joint variant calling.
The
cohort.recalibrated.vcf.gz
is the product of GATKs
Variant Quality Score Recalibration
.
The
cohort.recalibrated.pass.vep.vcf.gz
is the filtered and
VEP
annotated version of
cohort.recalibrated.vcf.gz
(only variants with
PASS
are kept).
For most applications, the
cohort.recalibrated.pass.vep.vcf.gz
file, is the file you want to continue working with.
Code Snippets
68 69 70 71 72 73 74 75 76 77 78 79 | shell: "mem_per_job=$(expr {resources.mem_mb} / {threads}) \n" "export mem_per_job \n" "cat {input.intervals} | parallel --plus -j {threads} '{input.GATK} --java-options -Xmx${{mem_per_job}}m BaseRecalibrator " "-I {input.marked_dup_bam} " "-R {input.reference} " "--known-sites {input.dbSNP} " "-L {{}} " "-O {params.recal_data}.{{/}}.table &> {log.out_recal}.{{/}}.out' \n" "touch {output.decoy}" |
108 109 110 111 112 113 | shell: "{input.GATK} GatherBQSRReports " "{params.recal_tables} " "-O {output.recal_data} &> {log.out_GatherBQSRReports} \n" "rm {params.tmp_dir}/{wildcards.patient}/normal_recal_bqsr.0*table" |
142 143 144 145 146 147 148 149 150 151 152 | shell: "mem_per_job=$(expr {resources.mem_mb} / {threads}) \n" "export mem_per_job \n" "cat {input.intervals} | parallel --plus -j {threads} '{input.GATK} --java-options -Xmx${{mem_per_job}}m ApplyBQSR " "-R {input.reference} " "-L {{}} " "-I {input.marked_dup_bam} " "--bqsr-recal-file {input.recal_data} " "-O {params.recal_bam}.{{/}}.bam &> {log.out_ApplyBQSR}.{{/}}.out' \n" "touch {output.decoy}" |
181 182 183 184 | shell: "{input.GATK} GatherBamFiles " "{params.bams} " "-O {output.final_recal_bam} &> {log.out_GatherRecalBamFiles} \n" |
211 212 | shell: "{input.samtools} sort -@ {threads} -o {output.recal_sorted} {input.final_recal_bam} &> {log.out_SortBam}" |
236 237 | shell: "{input.sambamba} index -p -t {threads} {input.recal_sorted} {output.recal_sorted_index} &> {log.out_IndexBam}" |
131 132 133 134 | shell: "{input.GATK} MergeVcfs {params.gvcfs} -O {output.merged_gvcf} &> {log.MergeHaplotypeCaller} \n" "rm -rf {params.tmp_dir}/{wildcards.patient} \n" "touch {output.decoy}" |
165 166 167 168 169 170 171 172 173 | shell: "mem_per_job=$(expr {resources.mem_mb} / {threads}) \n" "export mem_per_job \n" "cat {input.chromosomes} | parallel -j {threads} '{input.GATK} CombineGVCFs --java-options -Xmx${{mem_per_job}}m " "{params.annotation} -R {input.reference} {params.input_files} " "-O {params.cohort_dir}/{{}}.cohort.g.vcf.gz -L {{}} &> {log.CombineGVCFs_log}.{{}}.out' \n" "touch {output.decoy}" |
200 201 202 203 204 205 206 207 208 209 | shell: "mem_per_job=$(expr {resources.mem_mb} / {threads}) \n" "export mem_per_job \n" "cat {input.chromosomes} | parallel -j {threads} '{input.GATK} --java-options -Xmx${{mem_per_job}}m GenotypeGVCFs " "{params.annotation} -R {input.reference} -V {params.cohort_dir}/{{}}.cohort.g.vcf.gz " "-O {params.cohort_dir}/{{}}.cohort.vcf.gz &> {log.GenotypeGVCFs_log}.{{}}.out' \n" "rm {params.cohort_dir}/*g.vcf* \n" "touch {output.decoy}" |
239 240 241 242 | shell: "{input.GATK} MergeVcfs {params.cohort_gvcfs} -O {output.merged_cohort_vcf} &> {log.MergeCohortVCFs_log} \n" "{input.GATK} MakeSitesOnlyVcf -I {output.merged_cohort_vcf} -O {output.sites_only_vcf} &> {log.MakeSitesOnlyVcf_log} \n" "rm {params.cohort_dir}/{{1..22}}.coh* {params.cohort_dir}/X.coh* {params.cohort_dir}/Y.coh*" |
31 32 33 34 | shell: "{input.vep} --cache --dir_cache {params.vep_cache} --offline --fasta {input.reference} --vcf --buffer_size 55000 " "--pick --fork {threads} --sift b --variant_class --regulatory --check_existing --af_1kg --compress_output bgzip " "-i {input.pass_only_vcf} -o {output.vep_vcf} &> {log.out_VEP} \n" |
38 39 40 41 42 43 44 45 46 47 48 | shell: "{input.GATK} VariantRecalibrator -V {input.sites_only_vcf} --trust-all-polymorphic " "{params.truth_sensitivity_tranches} -an FS -an ReadPosRankSum -an MQRankSum -an QD -an SOR -an DP " "-mode INDEL " "--max-gaussians 4 " "-AS " "--resource:mills,known=false,training=true,truth=true,prior=12 {input.mills_indels} " "--resource:axiomPoly,known=false,training=true,truth=false,prior=10.0 {input.axiom_poly} " "--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {input.dbSNP} " "-O {output.cohort_indel_recal} " "--tranches-file {output.tranche_files} &> {log.VQSR_indel} \n" |
81 82 83 84 85 86 87 88 89 90 | shell: "{input.GATK} VariantRecalibrator -V {input.sites_only_vcf} --trust-all-polymorphic " "{params.truth_sensitivity_tranches} -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an SOR -an DP " "-mode SNP " "--max-gaussians 6 " "-AS " "--resource:hapmap,known=false,training=true,truth=true,prior=15.0 {input.hapmap} " "--resource:omni,known=false,training=true,truth=false,prior=12.0 {input.omni} " "--resource:1000G,known=false,training=true,truth=false,prior=10.0 {input.thousandG_phase1_snps_high_confidence} " "--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {input.dbSNP} " |
120 121 122 123 124 125 126 | shell: "{input.GATK} ApplyVQSR -V {input.merged_cohort_vcf} --recal-file {input.cohort_indel_recal} " "--tranches-file {input.tranche_files} --truth-sensitivity-filter-level 97 " "--create-output-variant-index true -mode INDEL " "-AS " "-O {output.indel_recalibrated_cohort} &> {log.ApplyVQSR_indel} \n" "rm {cohort_dir}/cohort.sites.only.vcf.gz.tbi" |
150 151 152 153 154 155 | shell: "{input.GATK} ApplyVQSR -V {input.indel_recalibrated_cohort} --recal-file {input.cohort_snp_recal} " "--tranches-file {input.tranche_files} --truth-sensitivity-filter-level 97 " "--create-output-variant-index true -mode SNP " "-AS " "-O {output.recalibrated_cohort} &> {log.ApplyVQSR_snp}" |
178 179 180 181 182 | shell: "{input.bcftools} view -f PASS {input.filtered_vcf} -o {output.pass_only_vcf} -O z &> {log.err_SelectVariants} \n" "rm -rf {params.cohort_dir}/filtration \n" "rm {params.cohort_dir}/cohort.indel.recalibrated.vcf.gz.tbi \n" "rm {params.cohort_dir}/cohort.vcf.gz.tbi \n" |
Support
- Future updates
Related Workflows





