This is an human RNAseq variant calling workflow, following the GATK pipeline. Also includes ADAR-site elimination.

public 1yr ago 0 bookmarks

View Workflow

rnaseq_variant_calling_workflow — View Workflow

Download

Clone the git repository with the command:

git clone https://github.com/durrantmm/rnaseq_variant_calling_workflow.git

Installation

This workflow uses a conda environment to satisfy all the necessary dependencies.

Make sure you have anaconda installed. Download it here .

You should be able to install the workflow simply by entering:

bash install.sh

In your console. Now it's time to move on to configuration.

Configuration

This is a very important step in running the workflow properly.

Open the provided config.yaml file to get started

Set GATK and Picard execution paths

The first two parameters of the config.yaml file are

gatk_path: "java -jar /path/to/GenomeAnalysisTK.jar"
picard_path: "java -jar /path/to/Picard.jar"

These are two strings that allow the workflow to execute GATK and Picard. You can download these from https://broadinstitute.github.io/picard/ and https://software.broadinstitute.org/gatk/download/

GATK requires you to make an account to download.

Once they are both installed, you can enter their execution paths in the config.yaml file.

You may want to add an addition flag, -Xmx50G , to make sure that Java allocates enough memory. The final config.yaml file would look something like this:

gatk_path: "java -jar -Xmx50G /path/to/GenomeAnalysisTK.jar"
picard_path: "java -jar -Xmx50G /path/to/Picard.jar"

Set working directory

The next parameter of the config.yaml file is wd: test . This is the default working directory, and it contains some test data to try out. You can give this a try by typing

snakemake --cores 12

It should take a few minutes to run, but once it runs correctly you know you are all set.

Setting up a workflow

It is essential that you set up your workflow properly before you run it. Follow these directions exactly.

IMPORTANT RULES AND TIPS TO REMEMBER:

When naming files, periods must only be used where specified. They cannot be used as a general name delimiter.
- myfile.R1.fq.gz - CORRECT - this is the right way to name a fastq file
- myfile.Run1.Replicate2.R1.fq.gz - INCORRECT - this is the WRONG way to name a fastq file.
- myfile_Run1_Replicate2.R1.fq.gz - CORRECT - Replacing the improper periods with underscores is appropriate.
Files can be be symbolic links to save disk space.

Overview

Navigate to your working directory. Make sure you have chosen a file system that has lots of free disk space available.

From within your working directory, create 5 directories as follows:

mkdir 0.fastq;
mkdir 0.reference_genome_fasta;
mkdir 0.gencode;
mkdir 0.dbsnp;
mkdir 0.adar_sites;

Or it may be easier to download a template working directory by executing

wget https://s3-us-west-1.amazonaws.com/mdurrant/biodb/bundles/rnaseq_variant_calling_workflow/wd_hg19.tar.gz;
tar -zxvf wd_hg19.tar.gz;

And then setting your wd parameter in the config file to this directory.

Now we'll go through each of the directories and describe how to populate them.

`0.fastq`

This is the directory that contains all of the paired-end FASTQ files to be analyzed. Single-end analysis is possible, but hasn't been built into this workflow automatically. You can figure that one out on your own.

These fastq files must follow this precise format.

<sample1>.R1.fq.gz
<sample1>.R2.fq.gz
<sample2>.R1.fq.gz
<sample2>.R2.fq.gz
...

This folder contains as many samples as you want. Make sure that the only periods are found in the .R#.fq.gz file extension.

`0.reference_genome_fasta`

This is the reference genome fasta file to be used in this analysis. You can have as many reference genomes as you want. Each genome must also have matching files in 0.gencode , 0.dbsnp , and 0.adar_sites . The identifying name must be the same between these directories.

They must be named as follows:

<genome>.fa

They cannot be gzipped.

`0.gencode`

This folder contains the gencode GTF file to be used by the STAR aligner.

It must follow the naming convention:

<genome>.gtf

It must have the same <genome> name as the corresponding reference genome in

0.reference_genome_fasta

A version corresponding to hg19 can be downloaded from http://www.gencodegenes.org/releases/19.html

`0.dbsnp`

This folder contains the dbsnp VCF file.

It must follow the naming convention

<genome>.vcf

It must have the same <genome> name as the corresponding reference genome in 0.reference_genome_fasta and 0.gencode .

`0.adar_sites`

This folder contains known ADAR A-to-I RNA editing sites. They are excluded outright from the resulting RNA-seq variant calls.

It must follow the naming convention

<genome>.txt

It must have the same <genome> name as the corresponding reference genome in 0.reference_genome_fasta , 0.gencode , and 0.dbsnp .

Run Snakemake

You should now be able to run the snakemake workflow.

You can run this with the command

snakemake

From the rnaseq_variant_calling_workflow directory.

You can set the number of cores used with

snakemake --cores 12

Cluster Job Submission

If you are using the SGE computing cluster system, you can submit a job using the provided script

qsub submission.sh

You may need to play with the parameters in cluster.json to make sure that each step has enough memory and wall time allocated to it.

Code Snippets

run:
    print("RNAVCW FINISHED WITH NO EXCEPTIONS!")

SnakeMake From line 42 of master/Snakefile

run:
    command = "{picard} CreateSequenceDictionary R={input} O={output}".format(picard=config['picard_path'],
              input=input, output=output)

    print(command)
    shell(command)

SnakeMake From line 51 of master/Snakefile

shell:
    "samtools faidx {input}"

SnakeMake SAMtools From line 64 of master/Snakefile

shell:
    "bgzip {input}"

SnakeMake From line 73 of master/Snakefile

shell:
    "tabix -p vcf {input}"

SnakeMake tabix From line 82 of master/Snakefile

shell:
    "grep -P \"^[^\t]+\t[^\t]+\texon\" {input} | bedtools sort -i stdin | "
    "bedtools merge -i stdin > {output}"

SnakeMake BEDTools From line 91 of master/Snakefile

shell:
    "mkdir -p {output.dir}; "
    "STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir {output.dir} "
    "--genomeFastaFiles {input.refgen} --sjdbGTFfile {input.gencode} --sjdbOverhang {params.overhang}"

SnakeMake STAR From line 105 of master/Snakefile

shell:
    "mkdir -p {output.dir}; "
    "STAR --readFilesIn {input.f1} {input.f2} --outFileNamePrefix {params.out_prefix} "
    "--genomeDir {input.genome_dir} --readFilesCommand zcat --runThreadN {threads} "
    "--genomeLoad NoSharedMemory --outFilterType BySJout "
    "--outSAMunmapped Within "
    "--outSAMattributes NH HI AS NM MD NM "
    "--twopassMode Basic "
    "--sjdbOverhang {params.overhang} "
    "--sjdbGTFfile {input.gencode}"

SnakeMake STAR From line 123 of master/Snakefile

run:
    command = "{picard} AddOrReplaceReadGroups I={input} O={output} SO=coordinate RGID=id " \
              "RGLB=library RGPL=ILLUMINA RGPU=machine RGSM=sample; ".format(picard=config['picard_path'],
              input=input, output=output)

    print(command)
    shell(command)

SnakeMake From line 140 of master/Snakefile

run:
    command = "{picard} MarkDuplicates I={input} O={output_bam} CREATE_INDEX=true " \
              "VALIDATION_STRINGENCY=SILENT M={output_metrics}".format(picard=config['picard_path'],
              input=input, output_bam=output.bam, output_metrics=output.metrics)

    print(command)
    shell(command)

SnakeMake From line 155 of master/Snakefile

run:
    command = "{gatk} -T SplitNCigarReads -I {input} -o {output} -R {refgen} " \
              "-rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 " \
              "-U ALLOW_N_CIGAR_READS".format(gatk=config['gatk_path'],
              input=input.bam, output=output, refgen=input.refgen)

    print(command)
    shell(command)

SnakeMake From line 172 of master/Snakefile

run:
    command = "{gatk} -T BaseRecalibrator -I {input} -o {output} -R {refgen} " \
              "-knownSites {dbsnp} -L {exons_bed}".format(gatk=config['gatk_path'],
              input=input.bam, output=output, refgen=input.refgen, dbsnp=input.dbsnp,
              exons_bed=input.exons_bed)

    print(command)
    shell(command)

SnakeMake From line 191 of master/Snakefile

run:
    command = "{gatk} -T PrintReads -I {input} -o {output} -R {refgen} " \
    "-BQSR {recal}".format(gatk=config['gatk_path'],
    input=input.bam, output=output, refgen=input.refgen, recal=input.recal)

    print(command)
    shell(command)

SnakeMake From line 208 of master/Snakefile

run:
    command = "{gatk} -T HaplotypeCaller -I {input} -o {output} -R {refgen} " \
    "-dontUseSoftClippedBases -stand_call_conf 20.0".format(gatk=config['gatk_path'],
    input=input.bam, output=output, refgen=input.refgen)

    print(command)
    shell(command)

SnakeMake From line 223 of master/Snakefile

run:
    command = "{gatk} -T VariantFiltration -V {input} -o {output} -R {refgen} " \
    "-window 35 -cluster 3 -filterName FS -filter \"FS > 30.0\" -filterName QD " \
    "-filter \"QD < 2.0\"".format(gatk=config['gatk_path'],
    input=input.vcf, output=output, refgen=input.refgen)

    print(command)
    shell(command)

SnakeMake From line 238 of master/Snakefile

run:
    tmp_bed = output.adar_bed+'.tmp'
    with open(input.adar_txt) as infile:
        infile.readline()
        with open(tmp_bed, 'w') as outfile:
            for line in infile:
                line = line.strip().split()
                chrom, start, end = line[0], int(line[1])-1, line[1]
                outfile.write('\t'.join([chrom, str(start), end])+'\n')
    shell('bedtools sort -i {bed} > {output}; rm {bed}'.format(bed=tmp_bed, output=output.adar_bed))

SnakeMake BEDTools From line 253 of master/Snakefile

shell:
    'bedtools intersect -header -v -a {input.vcf} -b {input.adar_bed} > {output}'

SnakeMake BEDTools From line 271 of master/Snakefile

run:
    command = "vcftools --vcf {invcf} --min-alleles 2 --max-alleles 2 " \
    "--remove-indels --recode --stdout | grep -Ev '1/1:|0/0:' | grep -v '1|1:' | grep -v '0|0:' " \
    "> {outvcf}; ".format(invcf=input, outvcf=output)

    print(command)
    shell(command)