An adaptable Snakemake workflow which uses GATKs best practice recommendations to perform germline mutation calling starting with BAM files

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This Snakemake workflow follows the GATK best-practice recommandations to call small germline variants.

The pipeline requires as inputs aligned BAM files (e.g. with BWA ) where the duplicates are already marked (e.g. with Picard or sambamba ). It then performed Base Quality Score Recalibration and joint genotyping of multiple samples , which is automatically parallized over user defined intervals (for examples see intervals.txt ) and chromosomes.

Filtering is performed using GATKs state-of-the-art Variant Quality Score Recalibration

At the end of the worklow, the Variant Effect Predictor is used to annotate the identified germline mutations.

A high level overview of the performed steps can be seen below:

DAG

As seen by the execution graph, an arbitrary number of samples/BAM files can be processed in parallel up to the joint variant calling.

Installation

Required tools:

The majority of the listed tools can be quite easily installed with conda which is recommanded.

Usage

First, modify the config_wgs.yaml and resources.yaml files. Both files contain detailed description what is expected. The config_wgs.yaml also contains links to some reference resources. Be careful, they are all specific for the GRCh37/hg19/b37 genome assembly.

After setting up all the config files and installing all tools, you can simply run:

snakemake --latency-wait 300 -j 5 --cluster "sbatch --mem={resources.mem_mb} --time {resources.runtime_min} --cpus-per-task {threads} --job-name={rule}.%j --output snakemake_cluster_submit_log/{rule}.%j.out --mail-type=FAIL"

This assumes that the cluster you are using is running SLURM . If this is not the case, you have to adjust the command after --cluster . The log information of each job will be safed in the snakemake_cluster_submit_log directory. This directory will not be created automatically.

-j specifies the number of jobs/rules should be submitted in parallel.

I recommand running this command in a detached session with tmux or screen .

Output

Below is the output of the tree command, after the workflow has finished for one patient H005-00ML. Usually you would include many patients simultaneously (>50). This is just to illustrate the created output files.

.
├── cohort
│ ├── benchmark
│ │ ├── ApplyVQSR_indel.txt
│ │ ├── ApplyVQSR_snp.txt
│ │ ├── CombineGVCFs.txt
│ │ ├── GenotypeGVCFs.txt
│ │ ├── MergeCohortVCFs.txt
│ │ ├── SelectVariants.txt
│ │ ├── VEP.txt
│ │ ├── VQSR_indel.txt
│ │ └── VQSR_snp.txt
│ ├── cohort.recalibrated.pass.vep.vcf.gz
│ ├── cohort.recalibrated.pass.vep.vcf.gz_summary.html
│ ├── cohort.recalibrated.vcf.gz
│ ├── cohort.recalibrated.vcf.gz.tbi
│ └── logs
│ ├── ApplyVQSR_indel.out
│ ├── ApplyVQSR_snp.out
│ ├── CombineGVCFs
│ ├── CombineGVCFs.1.out
│ ├── CombineGVCFs.2.out
│ ├── ...
│ ├── ...
│ ├── CombineGVCFs.Y.out
│ ├── GenotypeGVCFs.1.out
│ ├── GenotypeGVCFs.2.out
│ ├── ...
│ ├── ...
│ ├── GenotypeGVCFs.Y.out
│ ├── MakeSitesOnly.out
│ ├── MergeCohortVCFs.out
│ ├── SelectVariants.err
│ ├── VEP.out
│ ├── VQSR_indel.out
│ └── VQSR_snp.out
├── config
│ ├── config_wgs.yaml
│ └── resources.yaml
├── H005-00ML
│ ├── benchmark
│ │ ├── ApplyBQSR.txt
│ │ ├── BaseRecalibrator.txt
│ │ ├── GatherBQSRReports.txt
│ │ ├── GatherRecalBamFiles.txt
│ │ ├── HaplotypeCaller.txt
│ │ ├── IndexBam.txt
│ │ ├── MergeHaplotypeCaller.txt
│ │ └── SortBam.txt
│ ├── H005-00ML.germline.merged.g.vcf.gz
│ ├── H005-00ML.germline.merged.g.vcf.gz.tbi
│ └── logs
│ ├── ApplyBQSR
│ ├── ApplyBQSR.0000-scattered.interval_list.out
│ ├── ApplyBQSR.0001-scattered.interval_list.out
│ ├── ...
│ ├── ...
│ ├── ApplyBQSR.0049-scattered.interval_list.out
│ ├── BaseRecalibrator
│ ├── BaseRecalibrator.0000-scattered.interval_list.out
│ ├── BaseRecalibrator.0001-scattered.interval_list.out
│ ├── ...
│ ├── ...
│ ├── BaseRecalibrator.0049-scattered.interval_list.out
│ ├── GatherBQSRReports.out
│ ├── GatherRecalBamFiles.out
│ ├── HaplotypeCaller
│ ├── HaplotypeCaller.0000-scattered.interval_list.out
│ ├── HaplotypeCaller.0001-scattered.interval_list.out
│ ├── ...
│ ├── ...
│ ├── HaplotypeCaller.0049-scattered.interval_list.out
│ ├── IndexBam.out
│ ├── MergeHaplotypeCaller.out
│ └── SortBam.out
├── rules
│ ├── BaseQualityScoreRecalibration.smk
│ ├── JointGenotyping.smk
│ ├── VEP.smk
│ └── VQSR.smk
├── Snakefile
├── snakemake_cluster_submit_log
│ ├── ApplyBQSR.24720887.out
│ ├── ApplyVQSR_snp.24777265.out
│ ├── BaseRecalibrator.24710227.out
│ ├── CombineGVCFs.24772984.out
│ ├── GatherBQSRReports.24715726.out
│ ├── GatherRecalBamFiles.24722478.out
│ ├── GenotypeGVCFs.24773026.out
│ ├── HaplotypeCaller.24769848.out
│ ├── IndexBam.24768728.out
│ ├── MergeCohortVCFs.24776018.out
│ ├── MergeHaplotypeCaller.24772183.out
│ ├── SelectVariants.24777733.out
│ ├── SortBam.24768066.out
│ ├── VEP.24777739.out
│ ├── VQSR_indel.24776035.out
│ └── VQSR_snp.24776036.out

For each analyzed patient, a seperate directory gets created. Along with the patient specific gvcf file, this directory contains log files for all the processing steps that were performed for that patient ( log directory) as well as benchmarks for each rule, e.g. how long the step took or how much CPU/RAM was used ( benchmark directory).

The cohort directory contains the multi-sample VCF file, which gets created after performing the joint variant calling. The cohort.recalibrated.vcf.gz is the product of GATKs Variant Quality Score Recalibration . The cohort.recalibrated.pass.vep.vcf.gz is the filtered and VEP annotated version of cohort.recalibrated.vcf.gz (only variants with PASS are kept).

For most applications, the cohort.recalibrated.pass.vep.vcf.gz file, is the file you want to continue working with.

Code Snippets

shell:
     "mem_per_job=$(expr {resources.mem_mb} / {threads}) \n"
     "export mem_per_job \n"

     "cat {input.intervals} | parallel --plus -j {threads} '{input.GATK} --java-options -Xmx${{mem_per_job}}m BaseRecalibrator "
     "-I {input.marked_dup_bam} "
     "-R {input.reference} "
     "--known-sites {input.dbSNP} "
     "-L {{}} "
     "-O {params.recal_data}.{{/}}.table &> {log.out_recal}.{{/}}.out' \n"

     "touch {output.decoy}"

SnakeMake From line 68 of rules/BaseQualityScoreRecalibration.smk

shell:
     "{input.GATK} GatherBQSRReports "
     "{params.recal_tables} "
     "-O {output.recal_data} &> {log.out_GatherBQSRReports} \n"

     "rm {params.tmp_dir}/{wildcards.patient}/normal_recal_bqsr.0*table"

SnakeMake From line 108 of rules/BaseQualityScoreRecalibration.smk

shell:
     "mem_per_job=$(expr {resources.mem_mb} / {threads}) \n"
     "export mem_per_job \n"
     "cat {input.intervals} | parallel --plus -j {threads} '{input.GATK} --java-options -Xmx${{mem_per_job}}m ApplyBQSR "
     "-R {input.reference} "
     "-L {{}} "
     "-I {input.marked_dup_bam} "
     "--bqsr-recal-file {input.recal_data} "
     "-O {params.recal_bam}.{{/}}.bam &> {log.out_ApplyBQSR}.{{/}}.out' \n"

     "touch {output.decoy}"

SnakeMake gatk From line 142 of rules/BaseQualityScoreRecalibration.smk

shell:
     "{input.GATK} GatherBamFiles "
     "{params.bams} "
     "-O {output.final_recal_bam} &> {log.out_GatherRecalBamFiles} \n"

SnakeMake From line 181 of rules/BaseQualityScoreRecalibration.smk

shell:
     "{input.samtools} sort -@ {threads} -o {output.recal_sorted} {input.final_recal_bam} &> {log.out_SortBam}"

SnakeMake From line 211 of rules/BaseQualityScoreRecalibration.smk

shell:
     "{input.sambamba} index -p -t {threads} {input.recal_sorted} {output.recal_sorted_index} &> {log.out_IndexBam}"

SnakeMake From line 236 of rules/BaseQualityScoreRecalibration.smk

shell:
     "{input.GATK} MergeVcfs {params.gvcfs} -O {output.merged_gvcf} &> {log.MergeHaplotypeCaller} \n"
     "rm -rf {params.tmp_dir}/{wildcards.patient} \n"
     "touch {output.decoy}"

SnakeMake From line 131 of rules/JointGenotyping.smk

shell:
     "mem_per_job=$(expr {resources.mem_mb} / {threads}) \n"
     "export mem_per_job \n"

     "cat {input.chromosomes} | parallel -j {threads} '{input.GATK} CombineGVCFs --java-options -Xmx${{mem_per_job}}m "
     "{params.annotation} -R {input.reference} {params.input_files} "
     "-O {params.cohort_dir}/{{}}.cohort.g.vcf.gz -L {{}} &> {log.CombineGVCFs_log}.{{}}.out' \n"

     "touch {output.decoy}"

SnakeMake From line 165 of rules/JointGenotyping.smk

shell:
     "mem_per_job=$(expr {resources.mem_mb} / {threads}) \n"
     "export mem_per_job \n"

     "cat {input.chromosomes} | parallel -j {threads} '{input.GATK} --java-options -Xmx${{mem_per_job}}m GenotypeGVCFs  "
     "{params.annotation} -R {input.reference} -V {params.cohort_dir}/{{}}.cohort.g.vcf.gz "
     "-O {params.cohort_dir}/{{}}.cohort.vcf.gz &> {log.GenotypeGVCFs_log}.{{}}.out' \n"

     "rm {params.cohort_dir}/*g.vcf* \n"
     "touch {output.decoy}"

SnakeMake From line 200 of rules/JointGenotyping.smk

shell:
     "{input.GATK} MergeVcfs {params.cohort_gvcfs} -O {output.merged_cohort_vcf} &> {log.MergeCohortVCFs_log} \n"
     "{input.GATK} MakeSitesOnlyVcf -I {output.merged_cohort_vcf} -O {output.sites_only_vcf} &> {log.MakeSitesOnlyVcf_log} \n"
     "rm {params.cohort_dir}/{{1..22}}.coh* {params.cohort_dir}/X.coh* {params.cohort_dir}/Y.coh*"

SnakeMake From line 239 of rules/JointGenotyping.smk

shell:
     "{input.vep} --cache --dir_cache {params.vep_cache} --offline --fasta {input.reference} --vcf --buffer_size 55000 "
     "--pick --fork {threads} --sift b --variant_class --regulatory --check_existing --af_1kg --compress_output bgzip "
     "-i {input.pass_only_vcf} -o {output.vep_vcf} &> {log.out_VEP} \n"

SnakeMake From line 31 of rules/VEP.smk

shell:
     "{input.GATK} VariantRecalibrator -V {input.sites_only_vcf} --trust-all-polymorphic "
     "{params.truth_sensitivity_tranches} -an FS -an ReadPosRankSum -an MQRankSum -an QD -an SOR -an DP "
     "-mode INDEL "
     "--max-gaussians 4 "
     "-AS "
     "--resource:mills,known=false,training=true,truth=true,prior=12 {input.mills_indels} "
     "--resource:axiomPoly,known=false,training=true,truth=false,prior=10.0 {input.axiom_poly} "
     "--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {input.dbSNP} "
     "-O {output.cohort_indel_recal} "
     "--tranches-file {output.tranche_files} &> {log.VQSR_indel} \n"

SnakeMake From line 38 of rules/VQSR.smk

shell:
     "{input.GATK} VariantRecalibrator -V {input.sites_only_vcf} --trust-all-polymorphic "
     "{params.truth_sensitivity_tranches} -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an SOR -an DP "
     "-mode SNP "
     "--max-gaussians 6 "
     "-AS "
     "--resource:hapmap,known=false,training=true,truth=true,prior=15.0 {input.hapmap} "
     "--resource:omni,known=false,training=true,truth=false,prior=12.0 {input.omni} "
     "--resource:1000G,known=false,training=true,truth=false,prior=10.0 {input.thousandG_phase1_snps_high_confidence} "
     "--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {input.dbSNP} "

SnakeMake From line 81 of rules/VQSR.smk

shell:
     "{input.GATK} ApplyVQSR -V {input.merged_cohort_vcf} --recal-file {input.cohort_indel_recal} "
     "--tranches-file {input.tranche_files} --truth-sensitivity-filter-level 97 "
     "--create-output-variant-index true -mode INDEL "
     "-AS "
     "-O {output.indel_recalibrated_cohort} &> {log.ApplyVQSR_indel} \n"
     "rm {cohort_dir}/cohort.sites.only.vcf.gz.tbi"

SnakeMake From line 120 of rules/VQSR.smk

shell:
     "{input.GATK} ApplyVQSR -V {input.indel_recalibrated_cohort} --recal-file {input.cohort_snp_recal} "
     "--tranches-file {input.tranche_files} --truth-sensitivity-filter-level 97 "
     "--create-output-variant-index true -mode SNP "
     "-AS "
     "-O {output.recalibrated_cohort} &> {log.ApplyVQSR_snp}"

SnakeMake From line 150 of rules/VQSR.smk

shell:
     "{input.bcftools} view -f PASS {input.filtered_vcf} -o {output.pass_only_vcf} -O z &> {log.err_SelectVariants} \n"
     "rm -rf {params.cohort_dir}/filtration \n"
     "rm {params.cohort_dir}/cohort.indel.recalibrated.vcf.gz.tbi \n"
     "rm {params.cohort_dir}/cohort.vcf.gz.tbi \n"