Whole exome sequencing snakemake workflow based on GATK best practice
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
THIS WORKFLOW IS STILL UNDER DEVELOPMENT AND IS NOT READY YET!
Navigate to:
Introduction
Note: This workflow is still under development.
WES GATK is a flexible and user-friendly whole exome sequencing workflow based on GATK best practices . It is designed for processing Illumina WES short reads data and features automatic sample table generation, Snakemake configuration file, and simplified workflow execution.
Workflow
Description:
The workflow starts by converting
raw fastq
file to an
unmapped bam
file to include the read group information. Then, Illumina adaptors are marked and reads are aligned to hg38 reference human genome using BWA and consdier the alternative haplotype mapping. Multiple
BAM
files of the same sample are merged then PCR and optical duplicates are marked and reads are sorted by genome coordinates. Sorted BAM files are analayzied to generate mapping QC metrics. The following steps include the Base Quality score recalibration calculation using known variants files and GATK's Machine learning algorithm to recalibrate the quality scores and apply these new qualities to the
BAM
file. We now can call the variants using GATK's "haplotypecaller" in
GVCF
format, and combine all samples in a single
gvcf
file. The merged file is used for a final joint variant calling step to produce a single
vcf
file the include the genotypes of all samples. For better annotation, the file is splitted into two files, one for indels, and the other for snps. The last steps are variant filtration and variant annotation using Nirvana and Annovar.
Features:
This workflow is an implemetation of the Gold Standard GATK best practice in addition to these features:
-
Exome implementation ( uses user provided intervals file for specefic location calling + X basepair padding (default = 100pb see below ) )
-
Merging samples run on multiple lanes
-
QUALIMAP and multiqc QC reports
-
Nirvana and Annovar Annotation (more info)
-
Joint Gentotyping for all samples
-
Automatic sample name, group, lane and read number recognition.
-
Automatic snakemake sample table and config file generation.
-
progress bar of the analysis (use argument
--parse-snakemake-output
see below )
Workflow Diagram:
Quick Start
Installation
Clone the repository:
git clone https://github.com/AbdelrahmanYahia/wes_gatk.git
If you prefer to download the dependencies manually, you can find them [here].
Step-by-Step Installation Guide
Register ANNOVAR
To obtain the link, you need to register at Annovar website .
Make sure you have Conda and Mamba installed. If not, follow these steps:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh # Follow the prompts
Install Mamba:
conda install mamba -n base -c conda-forge
Restart your terminal.
Prepare the Environment
Change the permissions for the
Prep_ENV.sh
file:
chmod 777 wes_gatk/scripts/Prep_ENV.sh
Create the environment, install the tools, and download the annotations database:
./wes_gatk/scripts/Prep_ENV.sh $ANNOVAR_LINK
Follow the instruction here to download GATK required python env for CNV workflow This process may take some time.
You can also download all the required reference files using
wes_gatk/scripts/gatk_download_data.sh
:
chmod 777 wes_gatk/scripts/gatk_download_data.sh
bash wes_gatk/scripts/gatk_download_data.sh $DOWNLOAD_DIR
Running the Analysis
To start the analysis, activate the
wes_gatk
environment and run the
wes.py
file:
conda activate wes_gatk
cd wes_gatk
python3 wes.py WES --help
You can also use the Python file to generate the sample table and the config file, and then run Snakemake independently. Modify the parameters according to your needs:
conda activate wes_gatk
python3 wes.py WES \
--input PTH/to/samples \
--output PTH/to/outdir \
--reference-fasta broad_hg38/Homo_sapiens_assembly38.fasta \
--bed-file exome_bed/S07604715_Padded.bed \
--gff-file broad_hg38/Homo_sapiens.GRCh38.109.gff3.gz \
--nirvana-path ~/Nirvana \
--annovar-path ~/annovar_source/annovar \
--known-variants broad_hg38/1000G_omni2.5.hg38.vcf.gz \
--reference-index broad_hg38/Homo_sapiens_assembly38.fasta \
--generate-confs-only
As an alternative, you can edit the options in "prep_files.sh", then run it:
conda activate wes_gatk
bash prep_files.sh
To run a Snakemake dry-run:
conda activate wes_gatk
snakemake \ --snakefile workflow/Snakefile \ --configfile PTH/to/outdir/config.yml \ -n -j THREADS
Or you can run it by adding
--dry-run
argument to the python script:
conda activate wes_gatk
python3 wes.py WES \
--input PTH/to/samples \
--output PTH/to/outdir \
--reference-fasta broad_hg38/Homo_sapiens_assembly38.fasta \
--bed-file exome_bed/S07604715_Padded.bed \
--gff-file broad_hg38/Homo_sapiens.GRCh38.109.gff3.gz \
--nirvana-path ~/Nirvana \
--annovar-path ~/annovar_source/annovar \
--known-variants broad_hg38/1000G_omni2.5.hg38.vcf.gz \
--reference-index broad_hg38/Homo_sapiens_assembly38.fasta \
--generate-confs-only \
--dry-run
Advanced Parameters
For advanced usage, you can refer to the following command-line options: usage: Basic Run Usage example: guap WES -i indir -o outdir --bed-file file --reference-fasta fasta.fasta --reference-index indexpath
options:
-h, --help show this help message and exit
basic config:
-i in path, --input in path
Input directory path
-o out path, --output out path
Output directory path
Workflow configure:
--threads N Number of total threads to use [default = all]
--reference-fasta path/to/file.fa
path to reference fasta file
--bed-file path bed file path
--gff-file path gff file path
--nirvana-path path Path for Nirvana
--annovar-path path Path for annovar
--generate-confs-only
Generate sample table and config file only
--contig-ploidy-priors path
Path prior ploidy file
--call-CNV Perform GATK CNV pipeline
--dont-use-gatk-bestparctice
Generate sample table and config file only
QC configuration:
--trimmomatic Use trimmomatic
--trim-t N Number of threads to use during trim step
--trim-min-length N trimmomatic min length [default = 30]
--slidingwindow-size N
trimmomatic sliding window size [default = 4]
--slidingwindow-quality N
trimmomatic sliding window quality score [default = 10]
--trimmomatic-extra-args ='-args'
A string value of extra args for trimmomatic (must be used with = with no spaces (--trimmomatic-extra-args='-arg1 -arg2'))
--skip-QC Skipp Fastqc step
Samples pre-processing:
--gen-ubam-threads N Number of threads to use during creating ubam and marking adaptors [default = 4]
--gen-ubam-mem N Memory (GB) to use during creating ubam and marking adaptors [default = 5]
Aligner configuration:
--index-threads N Number of threads to use during indexing ref [default = 4]
--align-threads N Number of threads to use during sample alignment [default = 4]
--align-mem N Memory (GB) to use during sample alignment [default = 32]
--aligner-extra-args '-args'
Extra arguments for aligner, use it with no spaces and add = ( --aligner-extra-args='-arg1 -arg2' )
--reference-index path/to/ref
path to reference index
--reference-output-path path/to/ref
path to reference index
--reference-output-prefix path/to/ref
path to reference index
--index-fasta Index fasta file
Variant caller configuration:
--padding N Interval padding to include
--known-variants-indels path
path to reference fasta file
--known-variants-indels2 path
path to reference fasta file
--known-variants-snps path
path to reference fasta file
--caller-extra-args '-args'
Extra arguments for caller, use it with no spaces and add = ( --caller-extra-args='-arg1 -arg2' )
--calling-threads N Number of threads to use during variant calling [default = 4]
--calling-mem N Memory in GB to use during variant calling [default = 8]
Snakemake Options:
--dry-run performs snakemake dry run
--export-dag performs snakemake dry run and exports DAG
--smk-extra-args ='-args'
A string value of extra args for snakemake(must be used with = with no spaces (--smk-extra-args='-arg1 -arg2'))
--parse-snakemake-output
prints progress bar instead of snakemake regular output
Annotation configuration:
--annovar-protocol str
Annovar Protocol defaults: refGene,avsnp150,clinvar_20221231,cosmic70,dbnsfp31a_interpro,EAS.sites.2015_08,EUR.sites.2015_08,gme,gnomad211_exome,SAS.sites.2015_08
--annovar-operation str
Annovar Protocol defaults: g,f,f,f,f,f,f,f,f,f
--annovar-extra-args ='-args'
A string value of extra args for annovar(must be used with = with no spaces (--annovar-extra-args='-arg1 -arg2'))
--nirvana-extra-args ='-args'
A string value of extra args for nirvana(must be used with = with no spaces (--nirvana-extra-args='-arg1 -arg2'))
--annotation-threads N
Number of threads to use during creating ubam and marking adaptors [default = 4]
--annotation-mem N Memory (GB) to use during creating ubam and marking adaptors [default = 5]
Other:
--continue continue analysis when re-run
--overwrite overwrite output dir if exsits
-n str, --name str Name of files [ default = guap_run[date time] ]
--verbose verbose
--quit print many output
Code Snippets
186 187 | shell: "multiqc . -o multiqc/" |
211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | shell: """ ext={params.ext} R1={input.R1} SM={wildcards.sample} PL="Illumina" LB=$SM if [[ "$ext" == *".gz" ]]; then RGID=$(head -n1 <(zcat {input.R1}) | sed 's/:/_/g' |cut -d "_" -f1,2,3,4) #RGID=$(zcat {input.R1} | head -n1 | sed 's/:/_/g' |cut -d "_" -f1,2,3,4) else RGID=$(head {input.R1} -n1 | sed 's/:/_/g' |cut -d "_" -f1,2,3,4) fi ## TODO: confirm to use this PU=$RGID.$LB gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ FastqToSam \ -F1 {input.R1} \ -F2 {input.R2} \ -O {output} \ -SM $SM \ -LB $LB \ -PL $PL \ -RG $RGID \ -PU $PU \ """ |
253 254 255 256 257 258 259 260 261 | shell: ''' gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ MarkIlluminaAdapters \ -I {input} \ -O {output.bam} \ -M {output.metrics} \ # > {log} 2>&1 ''' |
309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 | shell: ''' gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ SamToFastq \ -I {input.bam} \ --FASTQ /dev/stdout \ --CLIPPING_ATTRIBUTE XT --CLIPPING_ACTION 2 \ --INTERLEAVE true -NON_PF true | bwa mem -K 100000000 \ -v 3 {params.bwa_args} -t {threads} -p -Y {params.index} /dev/stdin | gatk \ --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ MergeBamAlignment \ --VALIDATION_STRINGENCY SILENT \ --EXPECTED_ORIENTATIONS FR \ --ATTRIBUTES_TO_RETAIN X0 \ --ALIGNED_BAM /dev/stdin \ --UNMAPPED_BAM {input.bam} \ --OUTPUT {output.bam} \ -R {params.fa} \ --PAIRED_RUN true \ --SORT_ORDER "unsorted" \ --IS_BISULFITE_SEQUENCE false \ --ALIGNED_READS_ONLY false \ --CLIP_ADAPTERS false \ --MAX_RECORDS_IN_RAM 2000000 \ --ADD_MATE_CIGAR true \ --MAX_INSERTIONS_OR_DELETIONS -1 \ --PRIMARY_ALIGNMENT_STRATEGY MostDistant \ --UNMAPPED_READ_STRATEGY COPY_TO_TAG \ --ALIGNER_PROPER_PAIR_FLAGS true \ --UNMAP_CONTAMINANT_READS true {params.warp_workflow_params} ''' |
360 361 362 363 364 | shell: """ samtools depth {input} | awk '{{sum+=$3}} END {{print "Average = ",sum/NR, "No of covered Nuc = ", NR}}' > {output.cov} samtools flagstat {input} > {output.stats} """ |
383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 | shell: """ qualimap \ bamqc \ -bam {input.bam} \ --java-mem-size=15G \ --paint-chromosome-limits \ --genome-gc-distr HUMAN \ -nt {threads} \ -skip-duplicated \ --skip-dup-mode 0 \ -outdir {output} \ --feature-file {input.bed} \ -outformat HTML """ |
421 422 423 424 425 426 427 | shell: """ picard MergeSamFiles \ {params.bams} \ -OUTPUT {output} \ --SORT_ORDER "unsorted" """ |
448 449 450 451 452 453 454 455 456 457 458 459 | shell: """ picard MarkDuplicates -I {input} \ -O {output.bam} \ -M {output.matrix} \ --VALIDATION_STRINGENCY SILENT \ --OPTICAL_DUPLICATE_PIXEL_DISTANCE 2500 \ --ASSUME_SORT_ORDER "queryname" \ --CREATE_MD5_FILE true \ --CLEAR_DT false \ --ADD_PG_TAG_TO_READS false """ |
478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ SortSam \ --INPUT {input} \ --OUTPUT /dev/stdout \ --SORT_ORDER "coordinate" \ --CREATE_INDEX false \ --CREATE_MD5_FILE false | gatk \ --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ SetNmMdAndUqTags \ --INPUT /dev/stdin \ --OUTPUT {output} \ --CREATE_INDEX true \ --CREATE_MD5_FILE true \ --REFERENCE_SEQUENCE {params.fa} """ |
516 517 518 519 520 521 522 523 524 525 526 527 528 529 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ BaseRecalibrator \ -R {params.ref} \ -I {input} \ --use-original-qualities \ --known-sites {params.known_sites} \ --known-sites {params.known_sites2} \ --known-sites {params.known_sites3} \ -L {params.bed} \ --interval-padding {params.padding} \ -O {output} """ |
549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ ApplyBQSR -R {params.ref} \ -I {input.bam} \ -bqsr {input.report} -O {output} \ -L {params.bed} \ --interval-padding {params.padding} \ --static-quantized-quals 10 --static-quantized-quals 20 --static-quantized-quals 30 \ --add-output-sam-program-record \ --create-output-bam-md5 \ --use-original-qualities samtools index {output} """ |
584 585 586 587 588 589 590 591 592 593 594 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" BaseRecalibrator -R {params.ref} \ -I {input} \ --known-sites {params.known_sites} \ --known-sites {params.known_sites2} \ --known-sites {params.known_sites3} \ -L {params.bed} \ --interval-padding {params.padding} \ -O {output} """ |
613 614 615 616 617 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" AnalyzeCovariates -before {input.raw} \ -after {input.bqsr} -plots {output} """ |
640 641 642 643 | shell: """ bedtools slop -i {input.bed} -g {params.ref}.fai -b {params.padding} > {output} """ |
668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 | shell: ''' grep -vE "^@" {input.intervalslist} | awk -v OFS='\t' '$2=$2-1' | bedtools intersect -c -a {input.BED} -b - | cut -f6 > {output.target_overlap_counts} function restrict_to_overlaps() {{ # print lines from whole-genome file from loci with non-zero overlap # with target intervals WGS_FILE=$1 EXOME_FILE=$2 paste {output.target_overlap_counts} $WGS_FILE | grep -Ev "^0" | cut -f 2- > $EXOME_FILE echo "Generated $EXOME_FILE" }} restrict_to_overlaps {input.UD} {output.UD} restrict_to_overlaps {input.BED} {output.BED} restrict_to_overlaps {input.MU} {output.MU} ''' |
720 721 722 723 724 725 726 727 728 729 730 | shell: ''' python3 {params.get_con_file} \ --prefix {params.prefix} \ --output {output.contamination} \ --svdprefix {params.svdprefix} \ --get_con_file {params.get_con_file} \ --target_bam {input.target_bam} \ --ref {input.ref} \ --threads {threads} ''' |
769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 | shell: """ cont=$(cat {input.contamination}) echo $cont gatk --java-options "-Xmx{resources.reduced}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ HaplotypeCaller -R {params.ref} \ -G StandardAnnotation -G StandardHCAnnotation \ -G AS_StandardAnnotation {params.extra_args} \ -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \ -L {params.bed} \ --interval-padding {params.padding} \ -I {input.bam} --native-pair-hmm-threads {threads} -ERC GVCF -O {output.vcf} \ -contamination $cont \ -bamout {output.bam} \ """ |
811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 | shell: """ ## IMPORTANT! # Note that when uncalled alleles are dropped, # the original GQ may increase. Use --keep-all-alts # if GQ accuracy is a concern. gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ ReblockGVCF \ -R {params.ref} \ -G StandardAnnotation \ -G StandardHCAnnotation \ -G AS_StandardAnnotation \ --floor-blocks -OVI \ -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \ -L {params.bed} \ --interval-padding {params.padding} \ -do-qual-approx \ -V {input} -O {output} """ |
864 865 866 867 868 869 870 871 872 873 874 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ VariantFiltration \ -V {input} \ -L {params.bed} \ --interval-padding {params.padding} \ --filter-expression "QD < 2.0 || FS > 30.0 || SOR > 3.0 || MQ < 40.0 || MQRankSum < -3.0 || ReadPosRankSum < -3.0" \ --filter-name "HardFiltered" \ -O {output} """ |
899 900 901 902 903 904 905 906 907 908 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ CNNScoreVariants \ -V {input.vcf} \ -R {params.ref} \ -O {output} \ -I {input.bam} \ -tensor-type read-tensor """ |
940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ FilterVariantTranches \ -V ~{input_vcf} \ -O ~{vcf_basename}.filtered.vcf.gz \ --resource {params.hapmap_resource_vcf} \ --resource {params.omni_resource_vcf} \ --resource {params.one_thousand_genomes_resource_vcf} \ --resource {params.dbsnp_resource_vcf} \ --info-key CNN_1D \ --snp-tranche 99.95 \ --indel-tranche 99.4 \ --create-output-variant-index true """ |
978 979 980 981 982 983 984 985 | shell: """ touch {output} for i in {params.IDs} ; do echo -e "${{i}}\t04-1_gvcf-processing/reblocked/${{i}}.gvcf.gz" | cat >> {output} done """ |
1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 | shell: """ sort -k1,1 -k2,2n {input.bedfile} | awk ' {{ # If it is a different chromosome or if it is a non-overlapping interval if ($1 != chrom) {{ # If it is not the first interval if (chrom) {{ print chrom, start, end; }} chrom = $1; start = $2; end = $3; }} else if ($3 > end) {{ # If it is an overlapping interval end = $3; }} }} END {{ # Print the last interval print chrom, start, end; }} ' > {output.newbed} """ |
1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ SplitIntervals \ -R {input.ref} \ -L {input.intervals} \ --scatter-count {params.scatter_count} \ --interval-merging-rule ALL \ -mode INTERVAL_SUBDIVISION \ -O {output.dir} """ |
1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 | shell: """ # TODO: we can implement the update here: by checking and replacing: ## --genomicsdb-update-workspace-path path_to_dir instead of: ## --genomicsdb-workspace-path path_to_dir ## but when updating I don't need to add -L as it figures that out from the samples automatically # sources: https://eriqande.github.io/eca-bioinf-handbook/variant-calling.html#genomics-db # finally some notes: ## the documentation for GenomicsDBImport is replete with warnings ## about how you should backup your genomics data bases before trying ## to update them. Apparently there is a good chance that any little ## glitch during the updating process could corrupt the entire data base and render it useless. Joy! ## With that as a preamble, it does seem that we should try creating some data bases and then adding samples to them. ### use --genomicsdb-shared-posixfs-optimizations for HPC mkdir -p 04-2_GenomicsDB gatk --java-options "-Xmx{resources.reduced}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads} -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \ GenomicsDBImport \ --sample-name-map {input.sample_map} \ -L {input.intervals} \ --genomicsdb-workspace-path {output} \ --batch-size 50 \ --reader-threads 4 \ --merge-input-intervals \ --consolidate \ --genomicsdb-shared-posixfs-optimizations true \ """ |
1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ GenotypeGVCFs \ -G StandardAnnotation -G StandardHCAnnotation \ -G AS_StandardAnnotation \ -R {params.ref} -V gendb://{input.infile} -O {output} \ {params.Additional_annotation} --merge-input-intervals \ -L {input.intervals} \ --only-output-calls-starting-in-intervals \ """ |
1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 | shell: """ gatk --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' \ GatherVcfsCloud \ --ignore-safety-checks \ --gather-type BLOCK \ {params.gvcfs} \ -O {output} tabix {output} """ |
1204 1205 1206 1207 1208 1209 1210 1211 | shell: """ touch {output} gatk VariantEval \ -R {params.ref} \ --eval:{wildcards.sample} {input} \ -O {output} """ |
1227 1228 1229 1230 1231 1232 1233 1234 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" SelectVariants \ -R {params.ref} \ -V {input} \ --select-type-to-include INDEL \ -O {output} """ |
1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 | shell: """ gatk VariantFiltration \ --variant {input} \ --filter-expression "QD < 2.0" --filter-name "QD2" \ --filter-expression "SOR > 10.0" --filter-name "SOR10" \ --filter-expression "FS > 200.0" --filter-name "FS200" \ --filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \ --filter-expression "InbreedingCoeff < -0.8" --filter-name "InbreedingCoeff-0.8" \ --create-output-variant-index true \ --output {output} """ |
1280 1281 1282 1283 1284 1285 1286 1287 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" SelectVariants \ -R {params.ref} \ -V {input} \ --select-type-to-include SNP \ -O {output} """ |
1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 | shell: """ gatk VariantFiltration \ --variant {input} \ --filter-expression "QD < 2.0" --filter-name "QD2" \ --filter-expression "SOR > 3.0" --filter-name "SOR3" \ --filter-expression "FS > 60.0" --filter-name "FS60" \ --filter-expression "MQ < 40.0" --filter-name "MQ40" \ --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \ --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \ --create-output-variant-index true \ --output {output} """ |
1338 1339 1340 1341 | shell: """ bcftools csq -f {params.ref} -g {params.gff} {input} -Ov -o {output} """ |
1355 1356 1357 1358 | shell: """ bcftools stats {input} > {output} """ |
1371 1372 1373 1374 | shell: """ plot-vcfstats -p {output} {input} """ |
1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 | shell: """ eval "$(conda shell.bash hook)" set +u; conda activate dotnet; set -u dotnet {params.Nirvana_cmd} \ -i {input} \ -o {params.file_name} \ -r {params.Nirvana_ref} \ --sd {params.Nirvana_supplementray} \ -c {params.Nirvana_cache} {params.extra} """ |
1433 1434 1435 1436 1437 1438 1439 | shell: """ snpEff -Xmx{resources.mem_gb}G \ -v -csvStats {output.stats} \ -stats {output.html} \ {params.genome} {input} > {output.vcf} """ |
1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 | shell: """ mkdir -p {output} perl {params.annovar_dir}/table_annovar.pl {input} {params.annovar_dir}/humandb/ \ -buildver hg38 \ -out {params.output} -remove \ -protocol {params.protocol} \ -operation {params.operation} \ -nastring . \ -vcfinput \ --thread {threads} {params.extra} """ |
1493 1494 1495 1496 1497 1498 1499 1500 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" CollectQualityYieldMetrics \ --INPUT {input} \ --OQ true \ --OUTPUT {output} """ |
1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ CollectMultipleMetrics \ --INPUT {input} \ --REFERENCE_SEQUENCE {params.ref} \ --OUTPUT {params.prefix} \ --ASSUME_SORTED true \ --PROGRAM null \ --PROGRAM CollectAlignmentSummaryMetrics \ --PROGRAM CollectSequencingArtifactMetrics \ --PROGRAM QualityScoreDistribution \ --PROGRAM CollectGcBiasMetrics \ --METRIC_ACCUMULATION_LEVEL null \ --METRIC_ACCUMULATION_LEVEL SAMPLE \ --METRIC_ACCUMULATION_LEVEL LIBRARY \ --METRIC_ACCUMULATION_LEVEL ALL_READS """ |
1560 1561 1562 1563 1564 1565 1566 1567 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ ConvertSequencingArtifactToOxoG \ --INPUT_BASE {params.inPasename} \ --OUTPUT_BASE {params.prefix} \ --REFERENCE_SEQUENCE {params.ref} \ """ |
1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ CrosscheckFingerprints \ --OUTPUT {output} \ --HAPLOTYPE_MAP {params.htdbf} \ --EXPECT_ALL_GROUPS_TO_MATCH true \ {params.files} \ --LOD_THRESHOLD -10.0 \ """ |
1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ CollectMultipleMetrics \ --INPUT {input} \ --REFERENCE_SEQUENCE {params.ref} \ --OUTPUT {params.prefix} \ --ASSUME_SORTED true \ --PROGRAM null \ --PROGRAM CollectBaseDistributionByCycle \ --PROGRAM CollectInsertSizeMetrics \ --PROGRAM MeanQualityByCycle \ --PROGRAM QualityScoreDistribution \ --METRIC_ACCUMULATION_LEVEL null \ --METRIC_ACCUMULATION_LEVEL ALL_READS """ |
1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ CollectMultipleMetrics \ --INPUT {input} \ --REFERENCE_SEQUENCE {params.ref} \ --OUTPUT {params.prefix} \ --ASSUME_SORTED true \ --PROGRAM null \ --PROGRAM CollectAlignmentSummaryMetrics \ --PROGRAM CollectGcBiasMetrics \ --METRIC_ACCUMULATION_LEVEL null \ --METRIC_ACCUMULATION_LEVEL READ_GROUP """ |
1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ ValidateSamFile \ --INPUT {input} \ --OUTPUT {output} \ --REFERENCE_SEQUENCE {params.ref} \ --MODE VERBOSE \ --IS_BISULFITE_SEQUENCED false """ |
1707 1708 1709 1710 1711 1712 1713 1714 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ ValidateSamFile \ -V {input} \ -R {params.ref} \ --validation-type-to-exclude ALLELES """ |
1729 1730 1731 1732 1733 1734 1735 1736 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ BedToIntervalList \ -I {input.bed} \ -O {output} \ -SD {input.seq_dict} """ |
1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ CollectVariantCallingMetrics \ -I {input.vcf} \ --OUTPUT {params.prefix} \ --DBSNP {params.dbsnp} \ --SEQUENCE_DICTIONARY {params.refdict} \ --TARGET_INTERVALS {input.intervals} \ --GVCF_INPUT true """ |
1787 1788 1789 1790 1791 1792 1793 1794 | shell: """ java \ -jar {params.path_to_tool}/DISCVRSeq-1.3.49.jar VariantQC \ -R {params.ref} \ -V {input} \ -O {output} """ |
1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 | shell: """ mkdir -p 000_CNV_Intervalsprep gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ PreprocessIntervals \ -R {input.ref} \ -L {params.bed} \ --padding 250 \ --bin-length 0 \ -imr OVERLAPPING_ONLY \ -O {output} """ |
1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ CollectReadCounts \ -R {params.ref} \ -L {input.intervals} \ -imr OVERLAPPING_ONLY \ -I {input.sample} \ -O {output} """ |
1965 1966 1967 1968 1969 1970 1971 1972 1973 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ AnnotateIntervals \ -L {input.intervals} \ -R {params.ref} \ -imr OVERLAPPING_ONLY \ -O {output} """ |
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 | shell: """ echo {params.samples} gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ FilterIntervals \ -L {input.preprocessedIntervals}\ --annotated-intervals {input.annotatedIntervals} \ {params.samples} \ -imr OVERLAPPING_ONLY \ --minimum-gc-content 0.1 \ --maximum-gc-content 0.9 \ --minimum-mappability 0.9 \ --maximum-mappability 1.0 \ --minimum-segmental-duplication-content 0.0 \ --maximum-segmental-duplication-content 0.5 \ --low-count-filter-count-threshold 5 \ --low-count-filter-percentage-of-samples 90.0 \ --extreme-count-filter-minimum-percentile 1.0 \ --extreme-count-filter-maximum-percentile 99.0 \ --extreme-count-filter-percentage-of-samples 90.0 \ -O {output} """ |
2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 | shell: """ # eval "$(conda shell.bash hook)" # set +u; conda activate gatk ; set -u gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ DetermineGermlineContigPloidy \ -L {input.filteredIntervals}\ {params.samples} \ -imr OVERLAPPING_ONLY \ --contig-ploidy-priors {params.contigPloidyPriors} \ -O {output.dir} \ --output-prefix ploidy \ --verbosity DEBUG """ |
2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 | shell: """ # eval "$(conda shell.bash hook)" # set +u; conda activate gatk; set -u gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ GermlineCNVCaller \ --run-mode COHORT \ -L {input.interval} \ -I {params.samples} \ --contig-ploidy-calls {input.ploidy}/dogs-calls \ --annotated-intervals {input.annotated} \ --output-prefix {params.outpref} \ --interval-merging-rule OVERLAPPING_ONLY \ -O {params.outdir} """ |
2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ PostprocessGermlineCNVCalls \ --model-shard-path {input.model} \ --calls-shard-path {input.call} \ --allosomal-contig chrX --allosomal-contig chrY \ --contig-ploidy-calls ploidy-calls \ --sample-index {params.index} \ --output-genotyped-intervals {output.intervals} \ --output-genotyped-segments {output.segments} \ --sequence-dictionary {input.seq_dict} """ |
2161 2162 2163 2164 2165 2166 2167 2168 2169 | shell: """ gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \ VariantsToTable \ -V {input} \ -F CHROM -F POS -F END -GF NP -GF CN \ -O {output} """ |
2187 2188 2189 2190 2191 2192 2193 | shell: """ sampleName=$(gzcat {input.vcf} | grep -v '##' | head -n1 | cut -f10) awk -v sampleName=$sampleName 'BEGIN {FS=OFS="\t"} {print sampleName, $0}' {input.txt} > ${i}.seg; head ${i}.seg """ |
Support
- Future updates
Related Workflows





