This is an human RNAseq variant calling workflow, following the GATK pipeline. Also includes ADAR-site elimination.
Download
Clone the git repository with the command:
git clone https://github.com/durrantmm/rnaseq_variant_calling_workflow.git
Installation
This workflow uses a conda environment to satisfy all the necessary dependencies.
Make sure you have anaconda installed. Download it here .
You should be able to install the workflow simply by entering:
bash install.sh
In your console. Now it's time to move on to configuration.
Configuration
This is a very important step in running the workflow properly.
Open the provided
config.yaml
file to get started
Set GATK and Picard execution paths
The first two parameters of the
config.yaml
file are
gatk_path: "java -jar /path/to/GenomeAnalysisTK.jar"
picard_path: "java -jar /path/to/Picard.jar"
These are two strings that allow the workflow to execute GATK and Picard. You can download these from https://broadinstitute.github.io/picard/ and https://software.broadinstitute.org/gatk/download/
GATK requires you to make an account to download.
Once they are both installed, you can enter their execution paths in the
config.yaml
file.
You may want to add an addition flag,
-Xmx50G
, to make sure that Java allocates enough memory. The final
config.yaml
file would look something like this:
gatk_path: "java -jar -Xmx50G /path/to/GenomeAnalysisTK.jar"
picard_path: "java -jar -Xmx50G /path/to/Picard.jar"
Set working directory
The next parameter of the
config.yaml
file is
wd: test
. This is the default working directory, and it contains
some test data to try out. You can give this a try by typing
snakemake --cores 12
It should take a few minutes to run, but once it runs correctly you know you are all set.
Setting up a workflow
It is essential that you set up your workflow properly before you run it. Follow these directions exactly.
IMPORTANT RULES AND TIPS TO REMEMBER:
-
When naming files, periods must only be used where specified. They cannot be used as a general name delimiter.
-
myfile.R1.fq.gz
- CORRECT - this is the right way to name a fastq file -
myfile.Run1.Replicate2.R1.fq.gz
- INCORRECT - this is the WRONG way to name a fastq file. -
myfile_Run1_Replicate2.R1.fq.gz
- CORRECT - Replacing the improper periods with underscores is appropriate.
-
-
Files can be be symbolic links to save disk space.
Overview
Navigate to your working directory. Make sure you have chosen a file system that has lots of free disk space available.
From within your working directory, create 5 directories as follows:
mkdir 0.fastq;
mkdir 0.reference_genome_fasta;
mkdir 0.gencode;
mkdir 0.dbsnp;
mkdir 0.adar_sites;
Or it may be easier to download a template working directory by executing
wget https://s3-us-west-1.amazonaws.com/mdurrant/biodb/bundles/rnaseq_variant_calling_workflow/wd_hg19.tar.gz;
tar -zxvf wd_hg19.tar.gz;
And then setting your
wd
parameter in the config file to this directory.
Now we'll go through each of the directories and describe how to populate them.
0.fastq
This is the directory that contains all of the paired-end FASTQ files to be analyzed. Single-end analysis is possible, but hasn't been built into this workflow automatically. You can figure that one out on your own.
These fastq files must follow this precise format.
<sample1>.R1.fq.gz
<sample1>.R2.fq.gz
<sample2>.R1.fq.gz
<sample2>.R2.fq.gz
...
This folder contains as many samples as you want. Make sure that the only periods are found in the
.R#.fq.gz
file extension.
0.reference_genome_fasta
This is the reference genome fasta file to be used in this analysis. You can have as many reference genomes as you want.
Each genome must also have matching files in
0.gencode
,
0.dbsnp
, and
0.adar_sites
. The identifying name must
be the same between these directories.
They must be named as follows:
<genome>.fa
They cannot be gzipped.
0.gencode
This folder contains the gencode GTF file to be used by the STAR aligner.
It must follow the naming convention:
<genome>.gtf
It must have the same
<genome>
name as the corresponding reference genome in
0.reference_genome_fasta
A version corresponding to hg19 can be downloaded from http://www.gencodegenes.org/releases/19.html
0.dbsnp
This folder contains the dbsnp VCF file.
It must follow the naming convention
<genome>.vcf
It must have the same
<genome>
name as the corresponding reference genome in
0.reference_genome_fasta
and
0.gencode
.
0.adar_sites
This folder contains known ADAR A-to-I RNA editing sites. They are excluded outright from the resulting RNA-seq variant calls.
It must follow the naming convention
<genome>.txt
It must have the same
<genome>
name as the corresponding reference genome in
0.reference_genome_fasta
,
0.gencode
, and
0.dbsnp
.
Run Snakemake
You should now be able to run the snakemake workflow.
You can run this with the command
snakemake
From the
rnaseq_variant_calling_workflow
directory.
You can set the number of cores used with
snakemake --cores 12
Cluster Job Submission
If you are using the SGE computing cluster system, you can submit a job using the provided script
qsub submission.sh
You may need to play with the parameters in
cluster.json
to make sure that each step has enough memory and wall time
allocated to it.
Code Snippets
42 43 | run: print("RNAVCW FINISHED WITH NO EXCEPTIONS!") |
51 52 53 54 55 56 | run: command = "{picard} CreateSequenceDictionary R={input} O={output}".format(picard=config['picard_path'], input=input, output=output) print(command) shell(command) |
64 65 | shell: "samtools faidx {input}" |
73 74 | shell: "bgzip {input}" |
82 83 | shell: "tabix -p vcf {input}" |
91 92 93 | shell: "grep -P \"^[^\t]+\t[^\t]+\texon\" {input} | bedtools sort -i stdin | " "bedtools merge -i stdin > {output}" |
105 106 107 108 | shell: "mkdir -p {output.dir}; " "STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir {output.dir} " "--genomeFastaFiles {input.refgen} --sjdbGTFfile {input.gencode} --sjdbOverhang {params.overhang}" |
123 124 125 126 127 128 129 130 131 132 | shell: "mkdir -p {output.dir}; " "STAR --readFilesIn {input.f1} {input.f2} --outFileNamePrefix {params.out_prefix} " "--genomeDir {input.genome_dir} --readFilesCommand zcat --runThreadN {threads} " "--genomeLoad NoSharedMemory --outFilterType BySJout " "--outSAMunmapped Within " "--outSAMattributes NH HI AS NM MD NM " "--twopassMode Basic " "--sjdbOverhang {params.overhang} " "--sjdbGTFfile {input.gencode}" |
140 141 142 143 144 145 146 | run: command = "{picard} AddOrReplaceReadGroups I={input} O={output} SO=coordinate RGID=id " \ "RGLB=library RGPL=ILLUMINA RGPU=machine RGSM=sample; ".format(picard=config['picard_path'], input=input, output=output) print(command) shell(command) |
155 156 157 158 159 160 161 | run: command = "{picard} MarkDuplicates I={input} O={output_bam} CREATE_INDEX=true " \ "VALIDATION_STRINGENCY=SILENT M={output_metrics}".format(picard=config['picard_path'], input=input, output_bam=output.bam, output_metrics=output.metrics) print(command) shell(command) |
172 173 174 175 176 177 178 179 | run: command = "{gatk} -T SplitNCigarReads -I {input} -o {output} -R {refgen} " \ "-rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 " \ "-U ALLOW_N_CIGAR_READS".format(gatk=config['gatk_path'], input=input.bam, output=output, refgen=input.refgen) print(command) shell(command) |
191 192 193 194 195 196 197 198 | run: command = "{gatk} -T BaseRecalibrator -I {input} -o {output} -R {refgen} " \ "-knownSites {dbsnp} -L {exons_bed}".format(gatk=config['gatk_path'], input=input.bam, output=output, refgen=input.refgen, dbsnp=input.dbsnp, exons_bed=input.exons_bed) print(command) shell(command) |
208 209 210 211 212 213 214 | run: command = "{gatk} -T PrintReads -I {input} -o {output} -R {refgen} " \ "-BQSR {recal}".format(gatk=config['gatk_path'], input=input.bam, output=output, refgen=input.refgen, recal=input.recal) print(command) shell(command) |
223 224 225 226 227 228 229 | run: command = "{gatk} -T HaplotypeCaller -I {input} -o {output} -R {refgen} " \ "-dontUseSoftClippedBases -stand_call_conf 20.0".format(gatk=config['gatk_path'], input=input.bam, output=output, refgen=input.refgen) print(command) shell(command) |
238 239 240 241 242 243 244 245 | run: command = "{gatk} -T VariantFiltration -V {input} -o {output} -R {refgen} " \ "-window 35 -cluster 3 -filterName FS -filter \"FS > 30.0\" -filterName QD " \ "-filter \"QD < 2.0\"".format(gatk=config['gatk_path'], input=input.vcf, output=output, refgen=input.refgen) print(command) shell(command) |
253 254 255 256 257 258 259 260 261 262 | run: tmp_bed = output.adar_bed+'.tmp' with open(input.adar_txt) as infile: infile.readline() with open(tmp_bed, 'w') as outfile: for line in infile: line = line.strip().split() chrom, start, end = line[0], int(line[1])-1, line[1] outfile.write('\t'.join([chrom, str(start), end])+'\n') shell('bedtools sort -i {bed} > {output}; rm {bed}'.format(bed=tmp_bed, output=output.adar_bed)) |
271 272 | shell: 'bedtools intersect -header -v -a {input.vcf} -b {input.adar_bed} > {output}' |
280 281 282 283 284 285 286 | run: command = "vcftools --vcf {invcf} --min-alleles 2 --max-alleles 2 " \ "--remove-indels --recode --stdout | grep -Ev '1/1:|0/0:' | grep -v '1|1:' | grep -v '0|0:' " \ "> {outvcf}; ".format(invcf=input, outvcf=output) print(command) shell(command) |
Support
- Future updates
Related Workflows





