RNAseq Differential Expression Analysis Workflow with STAR and DESeq2

public 1yr ago Version: 6 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This workflow performs a differential expression analysis using STAR and DESeq2 . It performs a wide range of quality control steps and bundles there results in a qc report via MultiQC . Reported results include:

PCA plot of all samples
differentially up and down regulated genes for each analysis (ie treatment vs control)
MA plot for each analysis

Usage

Step 1: Configure workflow

You will likely run these analyses on the HPCC (currently Elzar), but you will want access to the results locally. Our strategy is to have the analysis repository on the local machine and on the HPCC, in the same file path location (relative to the home directory).

We will start by creating the local repository (on your computer):

Click on 'Use this template', which will copy the content of this repository to a new remote repositiory on github. Find a good name for your analysis and use it as the name for the repository;
Copy git link from this new repository, move to the parent folder for your RNA-seq analysis (ie where you want to save your new analyses/results) and create a local repository on your computer using git clone;
Move into the newly created directory (repository);
Configure the workflow by editing the file config.yaml ;
- Choose mouse or human for the STAR Index;
- Modify the design for deseq2;
Configure the samples.txt and units.txt files with project specific annotations and file;
- Hint: Check for rogue leading/trailing spaces in your file paths/sample names etc;
Git push changes to master.

We will now set up the analysis on the HPCC:

Copy the git link again and create a repository on the cluster (same path, see above) using git clone;
Move into the newly created directory (repository);
Create a reads directory and copy over the reads (fastq.gz files) from the sequencing facility.

To run the analysis, activate your conda snakemake environment:

 conda activate snakemake

Step 2: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it on the UGE cluster environment

snakemake --use-conda --profile uge

Step 3: Investigate results

After successful execution, you can create a self-contained interactive HTML report via:

snakemake --report report.html

to look at the results, use secure copy scp to transfer the report.html and the quality control report qc/multiqc.html from the HPCC to your local computer (this is where storing them in the same location relative to the home directory comes in handy);
explore the results;
forward to your collaborators.

This workflow was kindly provided by the Meyer lab .

Code Snippets

shell:
    """
    STAR \
    {params.extra} \
        --runThreadN {threads} \
        --runMode alignReads \
        --genomeDir {params.index} \
        --readFilesIn {input.fq1} {input.fq2} {params.readcmd} \
        --outReadsUnmapped Fastq \
        --outSAMtype BAM SortedByCoordinate \
        --outFileNamePrefix star/{wildcards.sample}-{wildcards.unit}/
    """

SnakeMake STAR From line 51 of rules/align.smk

script:
    "../scripts/count-matrix.py"

SnakeMake From line 21 of rules/diffexp.smk

script:
    "../scripts/deseq2-init.R"

SnakeMake DESeq2 From line 45 of rules/diffexp.smk

script:
    "../scripts/plot-pca.R"

SnakeMake From line 62 of rules/diffexp.smk

script:
    "../scripts/deseq2.R"

SnakeMake DESeq2 From line 87 of rules/diffexp.smk

script:
    "../scripts/gtf2bed.py"

SnakeMake From line 13 of rules/qc.smk

shell:
    """
    junction_annotation.py {params.extra} \
        -i {input.bam} \
        -r {input.bed} \
        -o {params.prefix}
    > {log[0]} 2>&1
    """

SnakeMake From line 30 of rules/qc.smk

shell:
    """
    junction_saturation.py {params.extra} \
        -i {input.bam} \
        -r {input.bed} \
        -o {params.prefix}
    > {log} 2>&1
    """

SnakeMake From line 53 of rules/qc.smk

shell:
    "bam_stat.py -i {input} > {output} 2> {log}"

SnakeMake From line 72 of rules/qc.smk

shell:
    "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}"

SnakeMake From line 86 of rules/qc.smk

shell:
    """
    inner_distance.py \
        -r {input.bed} \
        -i {input.bam} \
        -o {params.prefix} > {log} 2>&1
    """

SnakeMake From line 102 of rules/qc.smk

shell:
    "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}"

SnakeMake From line 121 of rules/qc.smk

shell:
    "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1"

SnakeMake From line 136 of rules/qc.smk

shell:
    "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1"

SnakeMake From line 151 of rules/qc.smk

shell:
    """
    multiqc \
        --force \
        --export \
        --outdir qc \
        --filename multiqc_report.html \
        logs/rseqc/rseqc_junction_annotation qc/rseqc star > {log}
        #{input} > {log}
    """

SnakeMake MultiQC From line 182 of rules/qc.smk

shell:
    """
    cutadapt \
        {params.adapters} \
        {params.others} \
        -o {output.fastq1} \
        -p {output.fastq2} \
        -j {threads} \
        {input} \
    > {output.qc}
    """

SnakeMake Cutadapt From line 20 of rules/trim.smk

shell:
    """
    cutadapt \
        {params.adapters} \
        {params.others} \
        -o {output.fastq} \
        -j {threads} \
        {input} \
    > {output.qc}
    """

SnakeMake Cutadapt From line 46 of rules/trim.smk

import pandas as pd

def get_column(strandedness):
    if pd.isnull(strandedness) or strandedness == "none":
        return 1 #non stranded protocol
    elif strandedness == "yes":
        return 2 #3rd column
    elif strandedness == "reverse":
        return 3 #4th column, usually for Illumina truseq
    else:
        raise ValueError(("'strandedness' column should be empty or have the " 
                          "value 'none', 'yes' or 'reverse', instead has the " 
                          "value {}").format(repr(strandedness)))

counts = [pd.read_table(f, index_col=0, usecols=[0, get_column(strandedness)],
          header=None, skiprows=4)
          for f, strandedness in zip(snakemake.input, snakemake.params.strand)]

for t, sample in zip(counts, snakemake.params.samples):
    t.columns = [sample]

matrix = pd.concat(counts, axis=1)
matrix.index.name = "gene"
# collapse technical replicates
if len(set(matrix.columns)) != len(matrix.columns):
    matrix = matrix.groupby(matrix.columns, axis=1).sum()
matrix.to_csv(snakemake.output[0], sep="\t")

Python Pandas From line 1 of scripts/count-matrix.py

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("tidyverse")
library("AnnotationHub")
library("DESeq2")

parallel <- FALSE
if (snakemake@threads > 1) {
    library("BiocParallel")
    # setup parallelization
    register(MulticoreParam(snakemake@threads))
    parallel <- TRUE
}

message("Reading counts")
cts <- read.table(snakemake@input[["counts"]], header=TRUE, row.names="gene",
                  check.names=FALSE)
message("Reading sample file")
coldata <- read.table(snakemake@params[["samples"]], header=TRUE,
                      row.names="sample", check.names=FALSE, sep="\t")
message("Getting experimental design")
design <- as.formula(snakemake@params[["design"]])

# colData and countData must have the same sample order
if (nrow(coldata) != ncol(cts)) {
    stop("Number of samples in sample sheet and number of samples in counts",
         "matrix is not the same")
}
cts <- cts[,match(rownames(coldata),colnames(cts))]
if (any(c("control", "Control", "CONTROL") %in% levels(coldata$condition))) {
    if ("control" %in% levels(coldata$condition)) {
        coldata$condition <- relevel(coldata$condition, "control" )
    } else if ("Control" %in% levels(coldata$condition)) {
        coldata$condition <- relevel(coldata$condition, "Control" )
    } else {
        coldata$condition <- relevel(coldata$condition, "CONTROL" )
    }
}
dds <- DESeqDataSetFromMatrix(countData=cts,
                              colData=coldata,
                              design=design)

# remove uninformative columns
dds <- dds[rowSums(counts(dds)) > 1,]

# normalization and preprocessing
dds <- DESeq(dds, parallel=parallel)

# Remove build number on ENS gene id
rownames(dds) <- gsub("\\.\\d*", "", rownames(dds))

# Annotate by gene names
hub <- AnnotationHub()
# query(hub,  c("GTF", "Ensembl", "Mus musculus")) "AH7799"
# query(hub,  c("GTF", "Ensembl", "Homo sapiens")) "AH69461"
if (snakemake@params[["annotationhub"]] == "mouse") {
    hubid <- "AH7799"
} else if(snakemake@params[["annotationhub"]] == "human") {
    hubid <- "AH69461"
} else {
    stop("No annotation hub specified for organism:",
         snakemake@params[["annotationhub"]])
}
anno <- hub[[hubid]]
genemap <- tibble(gene_id=anno$gene_id,
                  symbol=anno$gene_name) %>%
    distinct()

featureData <- tibble(gene_id=rownames(dds)) %>%
    left_join(genemap, by="gene_id") %>%
    mutate(symbol=case_when(is.na(symbol) ~ gene_id,
                            TRUE ~ symbol)) %>%
    select(symbol)
mcols(dds) <- DataFrame(mcols(dds), featureData)

saveRDS(dds, file=snakemake@output[[1]])

R tidyverse DESeq2 BiocParallel AnnotationHub From line 1 of scripts/deseq2-init.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

#################
## libraries ####
#################
library("DESeq2")
library("tidyverse")

#################
## functions ####
#################
genes_de <- function(deset, thrP=0.05, thrLog2FC=log2(1.5),
                     direction=c('up', 'down', 'any')) {
    tmp <- deset %>%
        as.data.frame %>%
        rownames_to_column(var="gene_id")
    if (direction == "up") {
        tmp <- tmp %>%
            dplyr::filter(padj < thrP,
                          log2FoldChange > thrLog2FC)
    } else if (direction == "down") {
        tmp <- tmp %>%
            dplyr::filter(padj < thrP,
                          log2FoldChange < thrLog2FC)
    } else if (direction == "any") {
        tmp <- tmp %>%
            dplyr::filter(padj < thrP)
    } else {
        stop(direction, "is not a valid option for direction")
    }
    tmp
}

save_up_down <- function(res, setup) {
    up  <- genes_de(res, direction="up")
    down  <- genes_de(res, direction="down")
    write_csv(up, snakemake@output[['up']])
    write_csv(down, snakemake@output[['down']])
    return(list(up=up$gene_id, down=down$gene_id))
}

############
## data ####
############
parallel <- FALSE
if (snakemake@threads > 1) {
    library("BiocParallel")
    # setup parallelization
    register(MulticoreParam(snakemake@threads))
    parallel <- TRUE
}

dds <- readRDS(snakemake@input[[1]])
directory <- "deseq2"

################
## analysis ####
################

## 1. Model fit ####
# Generate named coefficients need for apeglm lfcShrink
elements <- snakemake@params[["contrast"]]
comparison <- paste(elements[1], "vs", elements[2], sep="_")
coef <- paste("condition", elements[1], "vs", elements[2], sep="_")

# Relevel for reference to second element in contrasts
dds$condition <- relevel(dds$condition, elements[2])
dds <- nbinomWaldTest(dds)

## 2. Process results ####
# Extract coefficient specific results
res <- results(dds, name=coef, parallel=parallel)

# shrink fold changes for lowly expressed genes
res <- lfcShrink(dds, coef=coef, res=res, type='apeglm')

# add gene names to results object
res$gene_name <- mcols(dds)$symbol

# sort by p-value
res <- res[order(res$padj),]

## 3. Summarise results ####
## a) All genes for all groups ####
res_format <- res %>%
    as.data.frame() %>%
    rownames_to_column(var="gene_id") %>%
    as_tibble() %>%
    rename_at(vars(-gene_id, -gene_name), ~ paste0(., "_", comparison))

# normalised expression values
rld <- rlog(dds, blind = FALSE)
deg_genes <- assay(rld)

combined <- deg_genes %>%
    as.data.frame() %>%
    rownames_to_column(var="gene_id") %>%
    as_tibble() %>%
    right_join(res_format, by="gene_id") %>%
    dplyr::select(gene_id, gene_name, everything())

write_csv(combined, snakemake@output[["table"]])

## b) Up/Down genes ####
genes_up_down <- save_up_down(res=res, setup=coef)

## 4. Visualise results ####
svg(snakemake@output[["ma_plot"]])
plotMA(res, ylim=c(-2,2))
dev.off()

pdf(snakemake@output[["ma_pdf"]])
plotMA(res, ylim=c(-2,2))
dev.off()

R tidyverse DESeq2 BiocParallel From line 1 of scripts/deseq2.R

import gffutils

db = gffutils.create_db(snakemake.input[0],
                        dbfn=snakemake.output.db,
                        force=True,
                        keep_order=True,
                        merge_strategy='merge',
                        sort_attribute_values=True,
                        disable_infer_genes=True,
                        disable_infer_transcripts=True)

with open(snakemake.output.bed, 'w') as outfileobj:
    for tx in db.features_of_type('transcript', order_by='start'):
        bed = [s.strip() for s in db.bed12(tx).split('\t')]
        bed[3] = tx.id
        outfileobj.write('{}\n'.format('\t'.join(bed)))

Python From line 1 of scripts/gtf2bed.py

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("DESeq2")
library("ggplot2")
library("cowplot")
library("limma")
library("ggrepel")

dds <- readRDS(snakemake@input[[1]])
pca_color <- snakemake@params[['color']]
pca_fill <- snakemake@params[['fill']]


if (all(c(pca_color, pca_fill) != "")) {
    intgroup <- c(pca_color, pca_fill)
} else if (pca_color != "") {
    intgroup <- pca_color
} else if (pca_fill != "") {
    intgroup <- pca_fill
} else {
    stop("At least one of fill or color have to be specified")
}

# obtain normalized counts
counts <- rlog(dds, blind=FALSE)
#assay(counts) <- limma::removeBatchEffect(assay(counts), counts$individual)
pcaData <- plotPCA(counts, intgroup = intgroup, returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))


if (all(c(pca_color, pca_fill) != "")) {
    color_sym = sym(pca_color)
    fill_sym = sym(pca_fill)
    p <- ggplot(pcaData, aes(x = PC1, y = PC2, color = !!pca_color,
                                                          fill=!!pca_fill))
        p <- p + geom_point(pch=22,  size=3, stroke=2)
        p <- p + geom_text(aes(label=name), check_overlap=TRUE)
} else {
        intgroup_sym = sym(intgroup)
    p <- ggplot(pcaData, aes(x = PC1, y = PC2, color = !!intgroup_sym))
        p <- p + geom_point() +
                    scale_color_brewer(type="qual", palette="Dark2")
                    p <- p + geom_text(aes(label=name), check_overlap=TRUE)
}

p <- p +
    labs(x=paste0("PC1: ", percentVar[1], "% variance"),
         y=paste0("PC2: ", percentVar[2], "% variance")) +
    coord_fixed() +
    theme_cowplot() +
    theme(legend.position = "bottom",
          legend.justification = 0)
ggsave(plot=p, height=4.5, width=7.5, filename = "results/pca.pdf")
ggsave(plot=p, height=4.5, width=7.5, filename = "results/pca.svg")