Genomic variants - SNPs and INDELs detection using GATK4.
Author: AMBARISH KUMAR er.ambarish@gmail.com & ambari73_sit@jnu.ac.in
This is a proposed standard operating procedure for genomic variant detection using GATK4.
It is hoped to be effective and useful for getting SARS-CoV-2 genome variants.
It uses Illumina RNASEQ reads and genome sequence.
Code Snippets
864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 | baseCommand: - bowtie2 arguments: - valueFrom: | ${ if (inputs.filelist && inputs.filelist_mates){ return "-1"; } else if (inputs.filelist){ return "-U"; } else { return null; } } position: 82 - valueFrom: | ${ if (inputs.filelist && inputs.filelist_mates){ return "-2"; } else if (inputs.filelist_mates){ return "-U"; } else { return null; } } position: 84 - valueFrom: | ${ if (inputs.output_filename == ""){ return ' 2> ' + default_output_filename().split('.').slice(0,-1).join('.') + '.log'; } else { return ' 2> ' + inputs.output_filename.split('.').slice(0,-1).join('.') + '.log'; } } position: 100000 shellQuote: false |
226 227 228 229 230 231 232 | baseCommand: - bowtie2-build arguments: - valueFrom: $('2> ' + inputs.bt2_index_base + '.log') position: 100000 shellQuote: false |
4 5 6 | baseCommand: - gatk - HaplotypeCallerSpark |
5 6 7 | baseCommand: - gatk - SelectVariants |
5 6 7 | baseCommand: - gatk - SplitNCigarReads |
5 6 7 | baseCommand: - gatk - VariantFiltration |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 | baseCommand: - picard - AddOrReplaceReadGroups doc: |- Assigns all the reads in a file to a single new read-group. <h3>Summary</h3> Many tools (Picard and GATK for example) require or assume the presence of at least one <code>RG</code> tag, defining a "read-group" to which each read can be assigned (as specified in the <code>RG</code> tag in the SAM record). This tool enables the user to assign all the reads in the INPUT to a single new read-group. For more information about read-groups, see the <a href='https://www.broadinstitute.org/gatk/guide/article?id=6472'> GATK Dictionary entry.</a> <br /> This tool accepts as INPUT BAM and SAM files or URLs from the <a href="http://ga4gh.org/#/documentation">Global Alliance for Genomics and Health (GA4GH)</a>. <h3>Caveats</h3> The value of the tags must adhere (according to the <a href="https://samtools.github.io/hts-specs/SAMv1.pdf">SAM-spec</a>) with the regex <pre>#READGROUP_ID_REGEX</pre> (one or more characters from the ASCII range 32 through 126). In particular <code><Space></code> is the only non-printing character allowed. <br/> The program enables only the wholesale assignment of all the reads in the INPUT to a single read-group. If your file already has reads assigned to multiple read-groups, the original <code>RG</code> value will be lost. Documentation: http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups requirements: ShellCommandRequirement: {} InlineJavascriptRequirement: expressionLib: - | function generateGATK4BooleanValue(){ /** * Boolean types in GATK 4 are expressed on the command line as --<PREFIX> "true"/"false", * so patch here */ if(self === true || self === false){ return self.toString() } return self; } hints: DockerRequirement: dockerPull: quay.io/biocontainers/picard:2.22.2--0 inputs: - doc: Input file (BAM or SAM or a GA4GH url). [synonymous with -I] id: INPUT type: File inputBinding: prefix: INPUT= separate: false - doc: Read-Group library [synonymous with -LB] id: RGLB type: string inputBinding: prefix: RGLB= separate: false - doc: Read-Group platform (e.g. ILLUMINA, SOLID) [synonymous with -PL] id: RGPL type: string inputBinding: prefix: RGPL= separate: false - doc: Read-Group platform unit (eg. run barcode) [synonymous with -PU] id: RGPU type: string inputBinding: prefix: RGPU= separate: false - doc: Read-Group sample name [synonymous with -SM] id: RGSM type: string inputBinding: prefix: RGSM= separate: false - doc: Output filename (BAM or SAM) id: OUTPUT type: string inputBinding: prefix: OUTPUT= separate: false - doc: Reference sequence file. [synonymous with -R] id: REFERENCE_SEQUENCE type: File? inputBinding: prefix: REFERENCE_SEQUENCE= separate: false - doc: Optional sort order to output in. If not supplied OUTPUT is in the same order as INPUT. [synonymous with -SO] id: SORT_ORDER type: - 'null' - type: enum symbols: - unsorted - queryname - coordinate - duplicate - unknown inputBinding: prefix: SORT_ORDER= separate: false - doc: Read-Group sequencing center name [synonymous with -CN] id: RGCN type: string? inputBinding: prefix: RGCN= separate: false - doc: Read-Group description [synonymous with -DS] id: RGDS type: string? inputBinding: prefix: RGDS= separate: false - doc: Read-Group run date in Iso8601Date format [synonymous with -DT] id: RGDT type: string? inputBinding: prefix: RGDT= separate: false - doc: Read-Group flow order [synonymous with -FO] id: RGFO type: string? inputBinding: prefix: RGFO= separate: false - doc: Read-Group ID [synonymous with -ID] id: RGID type: string? inputBinding: prefix: RGID= separate: false - doc: Read-Group key sequence [synonymous with -KS] id: RGKS type: string? inputBinding: prefix: RGKS= separate: false - doc: Read-Group program group [synonymous with -PG] id: RGPG type: string? inputBinding: prefix: RGPG= separate: false - doc: Read-Group predicted insert size [synonymous with -PI] id: RGPI type: int? inputBinding: prefix: RGPI= separate: false - doc: Read-Group platform model [synonymous with -PM] id: RGPM type: string? inputBinding: prefix: RGPM= separate: false - doc: Control verbosity of logging. id: VERBOSITY type: - 'null' - type: enum symbols: - ERROR - WARNING - INFO - DEBUG inputBinding: prefix: VERBOSITY= separate: false - doc: Whether to suppress job-summary info on System.err. id: QUIET type: boolean? inputBinding: prefix: QUIET= valueFrom: $(generateGATK4BooleanValue()) separate: false - doc: Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. id: VALIDATION_STRINGENCY type: - 'null' - type: enum symbols: - STRICT - LENIENT - SILENT inputBinding: prefix: VALIDATION_STRINGENCY= separate: false - doc: Compression level for all compressed files created (e.g. BAM and VCF). id: COMPRESSION_LEVEL type: int? inputBinding: prefix: COMPRESSION_LEVEL= separate: false - doc: When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. id: MAX_RECORDS_IN_RAM type: int? inputBinding: prefix: MAX_RECORDS_IN_RAM= separate: false - doc: Use the JDK Deflater instead of the Intel Deflater for writing compressed output [synonymous with -use_jdk_deflater] id: USE_JDK_DEFLATER type: boolean? inputBinding: prefix: USE_JDK_DEFLATER= separate: false valueFrom: $(generateGATK4BooleanValue()) - doc: Use the JDK Inflater instead of the Intel Inflater for reading compressed input [synonymous with -use_jdk_inflater] id: USE_JDK_INFLATER type: boolean? inputBinding: prefix: USE_JDK_INFLATER= separate: false valueFrom: $(generateGATK4BooleanValue()) - doc: Whether to create a BAM index when writing a coordinate-sorted BAM file. id: CREATE_INDEX type: boolean? inputBinding: prefix: CREATE_INDEX= valueFrom: $(generateGATK4BooleanValue()) separate: false - doc: 'Whether to create an MD5 digest for any BAM or FASTQ files created. ' id: CREATE_MD5_FILE type: boolean? inputBinding: prefix: CREATE_MD5_FILE= valueFrom: $(generateGATK4BooleanValue()) separate: false - doc: Google Genomics API client_secrets.json file path. id: GA4GH_CLIENT_SECRETS type: File? inputBinding: prefix: GA4GH_CLIENT_SECRETS= separate: false arguments: - TMP_DIR=$(runtime.tmpdir) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | baseCommand: - picard - CreateSequenceDictionary doc: |- Create a SAM/BAM file from a fasta containing reference sequence. The output SAM file contains a header but no SAMRecords, and the header contains only sequence records. requirements: ShellCommandRequirement: {} InitialWorkDirRequirement: listing: - $(inputs.REFERENCE) InlineJavascriptRequirement: expressionLib: - | function generateGATK4BooleanValue(){ /** * Boolean types in GATK 4 are expressed on the command line as --<PREFIX> "true"/"false", * so patch here */ if(self === true || self === false){ return self.toString() } return self; } hints: DockerRequirement: dockerPull: quay.io/biocontainers/picard:2.22.2--0 inputs: - doc: Input reference fasta or fasta.gz [synonymous with -R] id: REFERENCE type: File inputBinding: valueFrom: REFERENCE=$(self.basename) - doc: Put into AS field of sequence dictionary entry if supplied [synonymous with -AS] id: GENOME_ASSEMBLY type: string? inputBinding: prefix: GENOME_ASSEMBLY= separate: false - doc: Put into UR field of sequence dictionary entry. If not supplied, input reference file is used [synonymous with -UR] id: URI type: string? inputBinding: prefix: URI= separate: false - doc: Put into SP field of sequence dictionary entry [synonymous with -SP] id: SPECIES type: string? inputBinding: prefix: SPECIES= separate: false - doc: Make sequence name the first word from the > line in the fasta file. By default the entire contents of the > line is used, excluding leading and trailing whitespace. id: TRUNCATE_NAMES_AT_WHITESPACE type: boolean? inputBinding: prefix: TRUNCATE_NAMES_AT_WHITESPACE= valueFrom: $(generateGATK4BooleanValue()) separate: false - doc: Stop after writing this many sequences. For testing. id: NUM_SEQUENCES type: int? inputBinding: prefix: NUM_SEQUENCES= separate: false - doc: "Optional file containing the alternative names for the contigs. Tools may\ \ use this information to consider different contig notations as identical (e.g:\ \ 'chr1' and '1'). The alternative names will be put into the appropriate @AN\ \ annotation for each contig. No header. First column is the original name, the\ \ second column is an alternative name. One contig may have more than one alternative\ \ name. [synonymous with -AN]" id: ALT_NAMES type: File? inputBinding: prefix: ALT_NAMES= separate: false - doc: Control verbosity of logging. id: VERBOSITY type: - 'null' - type: enum symbols: - ERROR - WARNING - INFO - DEBUG inputBinding: prefix: VERBOSITY= separate: false - doc: Whether to suppress job-summary info on System.err. id: QUIET type: boolean? inputBinding: prefix: QUIET= valueFrom: $(generateGATK4BooleanValue()) separate: false - doc: Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. id: VALIDATION_STRINGENCY type: - 'null' - type: enum symbols: - STRICT - LENIENT - SILENT inputBinding: prefix: VALIDATION_STRINGENCY= separate: false - doc: Compression level for all compressed files created (e.g. BAM and VCF). id: COMPRESSION_LEVEL type: int? inputBinding: prefix: COMPRESSION_LEVEL= separate: false - doc: When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. id: MAX_RECORDS_IN_RAM type: int? inputBinding: prefix: MAX_RECORDS_IN_RAM= separate: false - doc: Use the JDK Deflater instead of the Intel Deflater for writing compressed output [synonymous with -use_jdk_deflater] id: USE_JDK_DEFLATER type: boolean? inputBinding: prefix: USE_JDK_DEFLATER= separate: false valueFrom: $(generateGATK4BooleanValue()) - doc: Use the JDK Inflater instead of the Intel Inflater for reading compressed input [synonymous with -use_jdk_inflater] id: USE_JDK_INFLATER type: boolean? inputBinding: prefix: USE_JDK_INFLATER= separate: false valueFrom: $(generateGATK4BooleanValue()) - doc: Whether to create a BAM index when writing a coordinate-sorted BAM file. id: CREATE_INDEX type: boolean? inputBinding: prefix: CREATE_INDEX= valueFrom: $(generateGATK4BooleanValue()) separate: false - doc: 'Whether to create an MD5 digest for any BAM or FASTQ files created. ' id: CREATE_MD5_FILE type: boolean? inputBinding: prefix: CREATE_MD5_FILE= valueFrom: $(generateGATK4BooleanValue()) separate: false - doc: Google Genomics API client_secrets.json file path. id: GA4GH_CLIENT_SECRETS type: File? inputBinding: prefix: GA4GH_CLIENT_SECRETS= separate: false arguments: - TMP_DIR=$(runtime.tmpdir) - OUTPUT=$(inputs.REFERENCE.nameroot).dict |
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | baseCommand: ["java", "-jar"] arguments: - valueFrom: "MarkDuplicates" position: 2 - valueFrom: $(inputs.bam_sorted.nameroot)_duprem.bam prefix: "OUTPUT=" separate: false position: 13 - valueFrom: $(inputs.bam_sorted.nameroot)_duprem.log prefix: "METRICS_FILE=" separate: false position: 13 # log file - valueFrom: "REMOVE_DUPLICATES=TRUE" position: 14 - valueFrom: "ASSUME_SORTED=TRUE" position: 15 - valueFrom: "VALIDATION_STRINGENCY=SILENT" position: 16 - valueFrom: "VERBOSITY=INFO" position: 17 - valueFrom: "QUIET=false" position: 17 |
12 13 14 15 16 17 18 19 20 21 | baseCommand: [ samtools, faidx ] inputs: sequences: type: File doc: Input FASTA file arguments: - $(inputs.sequences.basename) |
19 20 21 22 23 24 25 26 27 28 29 | baseCommand: ["samtools", "index"] arguments: - valueFrom: -b # specifies that index is created in bai format position: 1 inputs: bam_sorted: doc: sorted bam input file type: File inputBinding: position: 2 |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/ambarishK/bio-cwl-tools/blob/release/gatk4W.cwl
Name:
genomic-variants-snps-and-indels-detection-using-g
Version:
Version 1
Downloaded:
0
Copyright:
Public Domain
License:
Boost Software License 1.0
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...