RASflow: RNA-Seq Analysis Snakemake Workflow

public 1yr ago Version: Version 1 0 bookmarks

View Workflow

rasflow-rna-seq-analysis-snakemake-workflow — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

RASflow: RNA-Seq Analysis Snakemake Workflow

RASflow is a modular, flexible and user-friendly RNA-Seq analysis workflow.

RASflow can be applied to both model and non-model organisms. It supports mapping RNA-Seq raw reads to both genome and transcriptome (can be downloaded from public database or can be homemade by users) and it can do both transcript- and gene-level Differential Expression Analysis (DEA) when transcriptome is used as mapping reference. It requires little programming skill for basic use. If you're good at programming, you can do more magic with RASflow!

Code Snippets

import yaml
import pandas as pd
import numpy as np
# import ipdb

def load_globals():

    global samples; global groups; global config; global input_path; global output_path

    with open('configs/config_main.yaml') as yamlfile:
        config = yaml.load(yamlfile)

    samples = np.array(pd.read_table(config["METAFILE"], header = 0)['sample'])

    groups = np.array(pd.read_table(config["METAFILE"], header = 0)['group'])

    input_path = config["FINALOUTPUT"] + "/" + config["PROJECT"] + "/genome/countFile"
    output_path = config["FINALOUTPUT"] + "/" + config["PROJECT"] + "/genome/dea"


def main():

    load_globals()

    groups_unique = np.unique(groups)

    num_samples = len(samples)
    num_groups_unique = len(groups_unique)

    id_list = get_id_list()

    for name_group in groups_unique:
        combine_group(name_group, id_list)

def combine_group(name_group, id_list):
    index_group = [index for index, element in enumerate(groups) if element == name_group]  # all the index of this group
    samples_group = samples[index_group]  # the samples belonging to this group

    group_count = pd.DataFrame()
    group_count['ID'] = id_list

    for sample in samples_group:
        group_count["sample_" + str(sample)] = np.array(pd.read_table(input_path + "/" + str(sample) + "_count.tsv", header = None))[:, 1]

    # write to file
    group_count.to_csv(output_path + "/countGroup/" + str(name_group) + "_gene_count.tsv", sep = "\t", header = True, index = False)

def get_id_list():

    # extract the id list from the count file
    id_list = np.array(pd.read_table(input_path + '/' + str(samples[0]) + "_count.tsv", header = None))[:, 0]
    return id_list


if __name__ == "__main__":
    main()

Python Pandas numpy PyYAML From line 3 of scripts/combine2group_genome.py

library(biomaRt)
library(yaml)
library(tximport)

# ====================== define some functions ======================

# remove the version in the transcript ID
remove_version <- function(files) {  # input files (file names with directory) are output from Salmon
  for (i in c(1:length(files))) {
    quant.file <- files[i]
    quant.table <- read.table(quant.file, header = TRUE, stringsAsFactors = FALSE)
    trans.id.version <- quant.table$Name
    trans.id <- rep('ID', length(trans.id.version))
    for (j in c(1:length(trans.id.version))) {
      trans.id[j] <- strsplit(trans.id.version[j], ".", fixed = TRUE)[[1]][1]
    }
    quant.table$Name <- trans.id
    write.table(quant.table, file.path(input.path, samples[i], "quant_noVersion.sf"), sep = "\t", quote = FALSE, row.names = FALSE)
  }
  return(trans.id)
}

# ====================== load parameters in config file ======================
# load the config file
yaml.file <- yaml.load_file('configs/config_main.yaml')

# extract the information from the yaml file
project <- yaml.file$PROJECT  # project name
input.path <- file.path(yaml.file$FINALOUTPUT, project, "trans/quant")
gene.level <- yaml.file$GENE_LEVEL
controls <- yaml.file$CONTROL  # all groups used as control
treats <- yaml.file$TREAT  # all groups used as treat, should correspond to control
meta.file <- yaml.file$METAFILE
ENSEMBL <- yaml.file$ENSEMBL
dataset <- yaml.file$EnsemblDataSet
tx2gene.file <- yaml.file$TX2GENE
output.path <- file.path(yaml.file$FINALOUTPUT, project, "trans/dea")

num.control <- length(controls)  # number of comparisons that the user wants to do
num.treat <- length(treats)  # should equals to num.control

if (num.control != num.treat) {
  message("Error: Control groups don't mathch with treat groups!")
  message("Please check config_dea.yaml")
  quit(save = 'no')
}

# extract the metadata
meta.data <- read.table(meta.file, header = TRUE, sep = '\t', stringsAsFactors = FALSE)
samples <- meta.data$sample
group.all <- meta.data$group
subject.all <- meta.data$subject

# ====================== prepare the quant files ======================

# the original quant files from Salmon
files <- file.path(input.path, samples, "quant.sf")
names(files) <- samples

# get the normalized abundance tables for all samples (scaled using the average transcript length over samples and then the library size (lengthScaledTPM))

# ====================== prepare the tx2gene table ======================
if (gene.level) {
    if (ENSEMBL) {
        ensembl <- useEnsembl(biomart = "ensembl", dataset = dataset)
        datasets <- listDatasets(ensembl)

        attributes <- listAttributes(mart = ensembl)

        # remove the version to get more matches
        trans.id <- remove_version(files)

        files.noVersion <- file.path(input.path, samples, "quant_noVersion.sf")
        names(files.noVersion) <- samples

        tx2gene <- getBM(attributes=c('ensembl_transcript_id', 'ensembl_gene_id'),
                        filters = 'ensembl_transcript_id', values = trans.id, mart = ensembl)
    } else {
        tx2gene <- read.csv(tx2gene.file, header = FALSE, sep = "\t")
    }
    # save tx2gene
    output.file.tx2gene <- file.path(output.path, "countGroup", 'tx2gene.RData')
    save(tx2gene, file = output.file.tx2gene)
}

# ====================== get raw and normalized abundance tables ======================

trans.matrix <- tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "no")
trans.count <- trans.matrix$counts

trans.matrix.tpm <- tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "lengthScaledTPM")
trans.count.tpm <- trans.matrix.tpm$counts

if (gene.level) {
    if (ENSEMBL) {
        gene.matrix <- tximport(files.noVersion, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "no")
        gene.count <- gene.matrix$counts

        gene.matrix.tpm <- tximport(files.noVersion, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
        gene.count.tpm <- gene.matrix.tpm$counts
    } else {
        gene.matrix <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "no")
        gene.count <- gene.matrix$counts

        gene.matrix.tpm <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
        gene.count.tpm <- gene.matrix.tpm$counts
    }
}

# ====================== combine the samples to groups ======================

for (group in unique(group.all)) {
  index.group = which(group.all == group)  # all the index of this group
  samples.group = samples[index.group]  # the samples belonging to this group

  group.count.trans <- trans.count[, index.group]
  group.count.trans.tpm <- trans.count.tpm[, index.group]

  if (gene.level) {
    group.count.gene <- gene.count[, index.group]
    group.count.gene.tpm <- gene.count.tpm[, index.group]
  }

  # write to files
  output.file.trans <- file.path(output.path, "countGroup", paste(group, "_trans_abundance.tsv", sep = ""))
  write.table(group.count.trans, output.file.trans, sep = '\t', quote = FALSE, row.names = TRUE, col.names = TRUE)

  output.file.trans.norm <- file.path(output.path, "countGroup", paste(group, "_trans_norm.tsv", sep = ""))
  write.table(group.count.trans.tpm, output.file.trans.norm, sep = '\t', quote = FALSE, row.names = TRUE, col.names = TRUE)

  if (gene.level) {
    output.file.gene <- file.path(output.path, "countGroup", paste(group, "_gene_abundance.tsv", sep = ""))
    write.table(group.count.gene, output.file.gene, sep = '\t', quote = FALSE, row.names = TRUE, col.names = TRUE)

    output.file.gene.norm <- file.path(output.path, "countGroup", paste(group, "_gene_norm.tsv", sep = ""))
    write.table(group.count.gene.tpm, output.file.gene.norm, sep = '\t', quote = FALSE, row.names = TRUE, col.names = TRUE)
  }
}

R Quant Salmon yaml biomaRt tximport From line 3 of scripts/combine2group_trans.R

library(yaml)
library(edgeR)
library(DESeq2)

# ====================== define the function of DEA ======================

DEA <- function(control, treat) {
  count.control <- read.table(paste(output.path, '/countGroup/', control, '_gene_count.tsv', sep = ''), header = TRUE, row.names = 1)
  count.treat <- read.table(paste(output.path, '/countGroup/', treat, '_gene_count.tsv', sep = ''), header = TRUE, row.names = 1)
  count.table <- cbind(count.control, count.treat)  # merge the control and treat tables together

  # number of samples in control and treat groups (should be the same if it's a pair test)
  num.sample <- ncol(count.table)
  num.sample.control <- ncol(count.control)
  num.sample.treat <- ncol(count.treat)

  # samples of two groups
  sample.control <- colnames(count.control)
  sample.treat <- colnames(count.treat)

  # save gene list in gene.list for extracting gene names later
  gene.list <- rownames(count.table)

  # get the sample id
  samples <- colnames(count.table)

  # define the subject and group; define the reference group (control)
  subject <- factor(subject.all[c(which(group.all == control), which(group.all == treat))])
  group <- relevel(factor(group.all[c(which(group.all == control), which(group.all == treat))]), ref = control)

  # The design matrix
  if (pair.test) {
    design <- model.matrix(~subject+group)
  } else {
    design <- model.matrix(~group)
  }

  if (dea.tool == 'edgeR') {  # use edgeR for DEA
    # normalize the two groups and save the normalized count table
    y.control <- DGEList(counts = count.control, genes = gene.list)
    y.treat <- DGEList(counts = count.treat, genes = gene.list)

    y.control <- calcNormFactors(y.control, method="TMM")
    count.table.control.norm <- cpm(y.control)
    write.table(count.table.control.norm, paste(output.path, '/countGroup/', control, '_gene_norm.tsv', sep = ''), quote = FALSE, sep = "\t")

    y.treat <- calcNormFactors(y.treat, method="TMM")
    count.table.treat.norm <- cpm(y.treat)
    write.table(count.table.treat.norm, paste(output.path, '/countGroup/', treat, '_gene_norm.tsv', sep = ''), quote = FALSE, sep = "\t")

    # Put the data into a DGEList object
    y <- DGEList(counts = count.table, genes = gene.list)

    # do DEA

    # Filtering
    if (filter.need) {
      countsPerMillion <- cpm(y)
      countCheck <- countsPerMillion > 1
      keep <- which(rowSums(countCheck) > 1)
      y <- y[keep, ]
    }

    # Normalization
    y <- calcNormFactors(y, method="TMM")

    y$samples$subject <- subject
    y$samples$group <- group

    rownames(design) <- colnames(y)

    # Estimating the dispersion

    # estimate the NB dispersion for the dataset
    y <- estimateDisp(y, design, robust = TRUE)

    # Differential expression

    # determine differentially expressed genes
    # fit genewise glms
    fit <- glmFit(y, design)

    # conduct likelihood ratio tests for tumour vs normal tissue differences and show the top genes
    lrt <- glmLRT(fit)

    # the DEA result for all the genes
    # dea <- lrt$table
    toptag <- topTags(lrt, n = nrow(y$genes), p.value = 1)
    dea <- toptag$table  # just to add one more column of FDR
    dea <- dea[order(dea$FDR, -abs(dea$logFC), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC

    # differentially expressed genes
    toptag <- topTags(lrt, n = nrow(y$genes), p.value = fdr.thr)
    deg <- toptag$table
    if (!is.null(deg)) {
      deg <- deg[order(deg$FDR, -abs(deg$logFC), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC
    }

    # save the DEA result and DEGs to files
    write.table(dea, paste(output.path, '/DEA/dea_', control, '_', treat, '.tsv', sep = ''), row.names = F, quote = FALSE, sep = '\t')
    write.table(deg, paste(output.path, '/DEA/deg_', control, '_', treat, '.tsv', sep = ''), row.names = F, quote = FALSE, sep = '\t') 
  } else if (dea.tool == "DESeq2") {  # use DESeq2 for DEA

    ## create the DESeqDataSet
    colData = data.frame(samples, subject, group)
    dds <- DESeqDataSetFromMatrix(count.table, colData = colData, design = design)

    # generate normalized counts
    dds <- estimateSizeFactors(dds)
    normalized_counts <- counts(dds, normalized=TRUE)

    normalized_counts.control <- normalized_counts[, which(colnames(normalized_counts) %in% sample.control)]
    write.table(normalized_counts.control, paste(output.path, '/countGroup/', control, '_gene_norm.tsv', sep = ''), quote = FALSE, sep = "\t")

    normalized_counts.treat <- normalized_counts[, which(colnames(normalized_counts) %in% sample.treat)]
    write.table(normalized_counts.treat, paste(output.path, '/countGroup/', treat, '_gene_norm.tsv', sep = ''), quote = FALSE, sep = "\t")

    ## Filtering
    if (filter.need) {
      keep <- rowSums(counts(dds)) >= 10
      dds <- dds[keep,]
    }

    ## perform DEA
    dds <- DESeq(dds)

    ## export the results
    res.dea <- results(dds)
    res.dea <- res.dea[complete.cases(res.dea), ]  # remove any rows with NA

    dea <- as.data.frame(res.dea)
    dea <- dea[order(dea$padj, -abs(dea$log2FoldChange), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC

    deg <- dea[dea$padj < fdr.thr, ]
    if (nrow(deg) > 1) {
      deg <- deg[order(deg$padj, -abs(deg$log2FoldChange), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC
    }

    # save the DEA result and DEGs to files
    write.table(dea, paste(output.path, '/DEA/dea_', control, '_', treat, '.tsv', sep = ''), row.names = T, quote = FALSE, sep = '\t')
    write.table(deg, paste(output.path, '/DEA/deg_', control, '_', treat, '.tsv', sep = ''), row.names = T, quote = FALSE, sep = '\t')
  }
}

# load the config file
yaml.file <- yaml.load_file('configs/config_main.yaml')

# extract the information from the yaml file
project <- yaml.file$PROJECT  # project name of this analysis
dea.tool <- yaml.file$DEATOOL  # tool used for DEA
controls <- yaml.file$CONTROL  # all groups used as control
treats <- yaml.file$TREAT  # all groups used as treat, should correspond to control
filter.need <- yaml.file$FILTER$yesOrNo
pair.test <- yaml.file$PAIR
fdr.thr <- yaml.file$FDR  # threshold of FDR/adjusted P-value for significantlly differentially expressed genes
meta.file <- yaml.file$METAFILE
output.path <- file.path(yaml.file$FINALOUTPUT, project, "genome/dea")

# extract the metadata
meta.data <- read.csv(meta.file, header = TRUE, sep = '\t')
group.all <- meta.data$group
subject.all <- meta.data$subject

num.control <- length(controls)  # number of comparisons that the user wants to do
num.treat <- length(treats)  # should equals to num.control

if (num.control != num.treat) {
  message("Error: Control groups don't mathch with treat groups!")
  message("Please check config_dea.yaml")
  quit(save = 'no')
}

num.comparison <- num.control

# Do DEA

# the main function
for (ith.comparison in c(1:num.comparison)) {
  control <- controls[ith.comparison]
  treat <- treats[ith.comparison]
  DEA(control, treat)
}

R DESeq2 yaml From line 1 of scripts/dea_genome.R

library(biomaRt)
library(yaml)
library(edgeR)
library(DESeq2)
library(tximport)

# ====================== define the function of DEA ======================

DEA <- function(control, treat, file.control, file.treat, output.path.dea) {
  count.control <- read.table(file.control, header = TRUE, row.names = 1, check.names = FALSE)
  count.treat <- read.table(file.treat, header = TRUE, row.names = 1, check.names = FALSE)
  count.table <- cbind(count.control, count.treat)  # merge the control and treat tables together

  # number of samples in control and treat groups (should be the same if it's a pair test)
  num.sample.control <- ncol(count.control)
  num.sample.treat <- ncol(count.treat)

  # save gene list in gene.list for extracting gene names later
  gene.list <- rownames(count.table)

  # get the sample id
  samples <- colnames(count.table)

  # define the subject and group; define the reference group (control)
  subject <- factor(subject.all[c(which(group.all == control), which(group.all == treat))])
  group <- relevel(factor(group.all[c(which(group.all == control), which(group.all == treat))]), ref = control)

  # The design matrix
  if (pair.test) {
    design <- model.matrix(~subject+group)
  } else {
    design <- model.matrix(~group)
  }

  if (dea.tool == 'edgeR') {  # use edgeR for DEA
    # Put the data into a DGEList object
    y <- DGEList(counts = count.table, genes = gene.list)

    # Filtering
    if (filter.need) {
      countsPerMillion <- cpm(y)
      countCheck <- countsPerMillion > 1
      keep <- which(rowSums(countCheck) > 1)
      y <- y[keep, ]
    }

    # Normalization
    y <- calcNormFactors(y, method="TMM")

    y$samples$subject <- subject
    y$samples$group <- group


    rownames(design) <- colnames(y)

    # Estimating the dispersion

    # estimate the NB dispersion for the dataset
    y <- estimateDisp(y, design, robust = TRUE)

    # Differential expression

    # determine differentially expressed genes
    # fit genewise glms
    fit <- glmFit(y, design)

    # conduct likelihood ratio tests for tumour vs normal tissue differences and show the top genes
    lrt <- glmLRT(fit)

    # the DEA result for all the genes
    # dea <- lrt$table
    toptag <- topTags(lrt, n = nrow(y$genes), p.value = 1)
    dea <- toptag$table  # just to add one more column of FDR
    dea <- dea[order(dea$FDR, -abs(dea$logFC), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC

    # differentially expressed genes
    toptag <- topTags(lrt, n = nrow(y$genes), p.value = fdr.thr)
    deg <- toptag$table
    if (!is.null(deg)) {
      deg <- deg[order(deg$FDR, -abs(deg$logFC), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC
    }

    # save the DEA result and DEGs to files
    write.table(dea, paste(output.path.dea, '/dea_', control, '_', treat, '.tsv', sep = ''), row.names = F, quote = FALSE, sep = '\t')
    write.table(deg, paste(output.path.dea, '/deg_', control, '_', treat, '.tsv', sep = ''), row.names = F, quote = FALSE, sep = '\t')
  } else if (dea.tool == "DESeq2") {  # use DESeq2 for DEA

    ## prepare txi

    if (gene.level & gene.level.flag) {
      ### load tx2gene
      load(file.path(output.path, "countGroup", 'tx2gene.RData'))

      if (ENSEMBL) {
          files.noVersion <- file.path(quant.path, samples, "quant_noVersion.sf")
          names(files.noVersion) <- samples

          ### import quantification as txi
          txi <- tximport(files.noVersion, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "no")
      } else {
          ### the original quant files from Salmon
          files <- file.path(quant.path, samples, "quant.sf")
          names(files) <- samples

          ### import quantification as txi
          txi <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "no")
      }
    } else {
      ### the original quant files from Salmon
      files <- file.path(quant.path, samples, "quant.sf")
      names(files) <- samples

      ### import quantification as txi
      txi <- tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "no")
    }

    ## create the DESeqDataSet
    colData = data.frame(samples, subject, group)
    dds <- DESeqDataSetFromTximport(txi, colData = colData, design = design)

    # Filtering
    if (filter.need) {
      keep <- rowSums(counts(dds)) >= 10
      dds <- dds[keep,]
    }

    ## specify the control group
    dds$group <- relevel(dds$group, ref = control)

    ## perform DEA
    dds <- DESeq(dds)

    ## export the results
    res.dea <- results(dds)
    res.dea <- res.dea[complete.cases(res.dea), ]  # remove any rows with NA

    dea <- as.data.frame(res.dea)
    dea <- dea[order(dea$padj, -abs(dea$log2FoldChange), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC

    deg <- dea[dea$padj < fdr.thr, ]
    if (nrow(deg) > 1) {
      deg <- deg[order(deg$padj, -abs(deg$log2FoldChange), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC
    }    

    # save the DEA result and DEGs to files
    write.table(dea, paste(output.path.dea, '/dea_', control, '_', treat, '.tsv', sep = ''), row.names = T, quote = FALSE, sep = '\t')
    write.table(deg, paste(output.path.dea, '/deg_', control, '_', treat, '.tsv', sep = ''), row.names = T, quote = FALSE, sep = '\t')
  }
}

# ====================== load parameters in config file ======================

# load the config file
yaml.file <- yaml.load_file('configs/config_main.yaml')

# extract the information from the yaml file
project <- yaml.file$PROJECT  # project name
dea.tool <- yaml.file$DEATOOL  # tool used for DEA
quant.path <- file.path(yaml.file$FINALOUTPUT, project, "trans/quant")
gene.level <- yaml.file$GENE_LEVEL  # whether to do gene-level analysis
controls <- yaml.file$CONTROL  # all groups used as control
treats <- yaml.file$TREAT  # all groups used as treat, should correspond to control
filter.need <- yaml.file$FILTER$yesOrNo
pair.test <- yaml.file$PAIR
meta.file <- yaml.file$METAFILE
ENSEMBL <- yaml.file$ENSEMBL
dataset <- yaml.file$EnsemblDataSet
output.path <- file.path(yaml.file$FINALOUTPUT, project, "trans/dea")
fdr.thr <- yaml.file$FDR  # threshold of FDR/adjusted P-value for significantlly differentially expressed genes
num.control <- length(controls)  # number of comparisons that the user wants to do
num.treat <- length(treats)  # should equals to num.control

if (num.control != num.treat) {
  message("Error: Control groups don't mathch with treat groups!")
  message("Please check config_dea.yaml")
  quit(save = 'no')
}

num.comparison <- num.control

# extract the metadata
meta.data <- read.csv(meta.file, header = TRUE, sep = '\t', stringsAsFactors = FALSE)
samples <- meta.data$sample
group.all <- meta.data$group
subject.all <- meta.data$subject

# ====================== Do DEA ======================

for (ith.comparison in c(1:num.comparison)) {
  control <- controls[ith.comparison]
  treat <- treats[ith.comparison]

  # --------------------- On transctipt level ---------------------
  file.control <- paste(output.path, '/countGroup/', control, '_trans_abundance.tsv', sep = '')
  file.treat <- paste(output.path, '/countGroup/', treat, '_trans_abundance.tsv', sep = '')
  output.path.dea <- paste(output.path, '/DEA/transcript-level', sep = '')

  gene.level.flag <- FALSE
  DEA(control, treat, file.control, file.treat, output.path.dea)

  # --------------------- On gene level ---------------------
  if (gene.level) {
    file.control <- paste(output.path, '/countGroup/', control, '_gene_abundance.tsv', sep = '')
    file.treat <- paste(output.path, '/countGroup/', treat, '_gene_abundance.tsv', sep = '')
    output.path.dea <- paste(output.path, '/DEA/gene-level', sep = '')

    gene.level.flag <- TRUE
    DEA(control, treat, file.control, file.treat, output.path.dea)
  }
}

R Quant DESeq2 Salmon yaml biomaRt tximport From line 3 of scripts/dea_trans.R

awk -F'"' '{print $2"\t"$6}' $1 > $2

Shell From line 9 of scripts/formatCount.sh

if (!require("plotscale")) install.packages('scripts/plotscale_0.1.6.tar.gz', repos = NULL, type="source")
library(yaml)
library(hash)
library(mygene)
library(EnhancedVolcano)
library(plotscale)

# ====================== load parameters in config file ======================

# passing the params from command line
args <- commandArgs(TRUE)
norm.path <- args[1]
dea.path <- args[2]
out.path <- args[3]

# load the config file
if (length(args) > 3) {  # this script is used in  visualize_test.rules
    yaml.file <- yaml.load_file(args[4])
} else {  # this script is used in visualize.rules
    yaml.file <- yaml.load_file('configs/config_main.yaml')
}

# extract the information from the yaml file
controls <- yaml.file$CONTROL
treats <- yaml.file$TREAT
dea.tool <- yaml.file$DEATOOL

# check the number of comparisons
num.control <- length(controls)  # number of comparisons that the user wants to do
num.treat <- length(treats)  # should equals to num.control

if (num.control != num.treat) {
  message("Error: Control groups don't mathch with treat groups!")
  message("Please check config_dea.yaml")
  quit(save = 'no')
}

num.comparison <- num.control

convert.id2symbol <- function(gene.id) {
  gene.symbol <- gene.id  # initialize the gene symbol with the gene id

  # it may happen that no symbol can be found for any id. In that case, "queryMany" will throw an error
  # so "try" is used here to take care of that error
  try({
    gene.symbol.all <- queryMany(gene.id, scopes = 'ensembl.gene', fields = 'symbol')

    h <- hash()
    for (i in 1:nrow(gene.symbol.all)) {
      query <- gene.symbol.all$query[i]
      symbol <- gene.symbol.all$symbol[i]
      if (has.key(query, h)) {  # if there's duplicate for the same query
        h[[query]] <- paste(hash::values(h, keys = query), symbol, sep = ', ')
      } else {
        if (is.na(symbol)) {  # if there's no hit for the query, keep the original id
          h[[query]] <- query
        } else {
          h[[query]] <- symbol
        }
      }
    }

    for (i in c(1:length(gene.symbol))) {
      gene.symbol[i] <- h[[gene.id[i]]]
    }
  })

  return(gene.symbol)
}

# function to plot volcano plot and heatmap
plot.volcano.heatmap <- function(name.control, name.treat) {
  file.dea.table <- paste(dea.path, "/dea_", name.control, "_", name.treat, ".tsv", sep = "")
  norm.control <- paste(norm.path, "/", name.control, "_gene_norm.tsv", sep = "")  # normalized table of control
  norm.treat <- paste(norm.path, "/", name.treat, "_gene_norm.tsv", sep = "")  # normalized table of treat

  dea.table <- read.table(file.dea.table, header = TRUE, row.names = 1)
  # sort the dea table: ascending of FDR then descending of absolute valued of logFC
  if (dea.tool == 'edgeR') {
    dea.table <- dea.table[order(dea.table$FDR, -abs(dea.table$logFC), decreasing = FALSE), ]  
  } else if (dea.tool == 'DESeq2') {
    dea.table <- dea.table[order(dea.table$padj, -abs(dea.table$log2FoldChange), decreasing = FALSE), ]
  }

  gene.id.dea <- row.names(dea.table)

  # convert the gene id to gene symbol, and keep the ID if no symbol can be found for that ID
  gene.dea <- convert.id2symbol(gene.id.dea)

  # volcano plot
  dea.table.volcano <- dea.table  # for better volcano plot, 0 FDRs/padj will be changed to a very low value

  if (dea.tool == 'edgeR') {
    # change the 0 FDR to a low value (100 times smaller than the minumum non-zero value)
    FDR <- dea.table.volcano$FDR
    FDR.min.non.zero <- min(FDR[FDR>0])
    dea.table.volcano$FDR[FDR==0] <- FDR.min.non.zero/100
    ## define the range of x-axis and y-axis
    log2FC_lim <- max(abs(dea.table.volcano$logFC))
    FDR_lim <- -log10(min(dea.table.volcano$FDR)) # NAs already removed from dea.table

    fig.volcano <- EnhancedVolcano(dea.table.volcano, lab = gene.dea, xlab = bquote(~Log[2]~ "fold change"), x = 'logFC', y = 'FDR', pCutoff = 10e-5, col = c("grey30", "orange2", "royalblue", "red2"),
                                   FCcutoff = 1, xlim = c(-log2FC_lim-1, log2FC_lim+1), ylim = c(0, FDR_lim), title = NULL, subtitle = NULL)
  } else if (dea.tool == 'DESeq2') {
    # change the 0 padj to a low value (100 times smaller than the minumum non-zero value)
    padj <- dea.table.volcano$padj
    padj.min.non.zero <- min(padj[padj>0])
    dea.table.volcano$padj[padj==0] <- padj.min.non.zero/100
    ## define the range of x-axis and y-axis
    log2FC_lim <- max(abs(dea.table.volcano$log2FoldChange))
    padj_lim <- -log10(min(dea.table.volcano$padj)) # NAs already removed from dea.table

    fig.volcano <- EnhancedVolcano(dea.table.volcano, lab = gene.dea, xlab = bquote(~Log[2]~ "fold change"), x = 'log2FoldChange', y = 'padj', pCutoff = 10e-5, col = c("grey30", "orange2", "royalblue", "red2"),
                                   FCcutoff = 1, xlim = c(-log2FC_lim-1, log2FC_lim+1), ylim = c(0, padj_lim), title = NULL, subtitle = NULL)
  }

  # ## if the range of x-axis or y-axis gets crazy, you may also manually define it to show a subset of the genes (but to be noted, the genes out of range will not be shown)
  # if (dea.tool == 'edgeR') {
  #   fig.volcano <- EnhancedVolcano(dea.table, lab = gene.dea, xlab = bquote(~Log[2]~ "fold change"), x = 'logFC', y = 'FDR', pCutoff = 10e-5, col = c("grey30", "orange2", "royalblue", "red2"),
  #                                FCcutoff = 1, xlim = c(-5, 5), ylim = c(0, 10), transcriptPointSize = 1.5, title = NULL, subtitle = NULL)
  # } else if (dea.tool == 'DESeq2') {
  #   fig.volcano <- EnhancedVolcano(dea.table, lab = gene.dea, xlab = bquote(~Log[2]~ "fold change"), x = 'log2FoldChange', y = 'padj', pCutoff = 10e-5, col = c("grey30", "orange2", "royalblue", "red2"),
  #                                FCcutoff = 1, xlim = c(-5, 5), ylim = c(0, 10), transcriptPointSize = 1.5, title = NULL, subtitle = NULL)
  # }

  as.pdf(fig.volcano, width = 9, height = 6, scaled = TRUE, file = file.path(out.path, paste('volcano_plot_', name.control, '_', name.treat, '.pdf', sep = '')))

  # heatmap
  norm.table.control <- read.table(norm.control, header = TRUE, row.names = 1)
  norm.table.treat <- read.table(norm.treat, header = TRUE, row.names = 1)

  num.control <- dim(norm.table.control)[2]
  num.treat <- dim(norm.table.treat)[2]

  norm.table <- cbind(norm.table.control, norm.table.treat)
  groups <- c(name.control, name.treat)

  # instead using all genes, only use the top 20 genes in dea.table
  index.deg <- which(row.names(norm.table) %in% gene.id.dea[1:20])
  norm.table.deg <- norm.table[index.deg,]

  gene.id.norm.table <- rownames(norm.table.deg)

  # convert the gene id to gene symbol, and keep the ID if no symbol can be found for that ID
  gene.norm.table <- convert.id2symbol(gene.id.norm.table)

  palette <- c("#999999", "#377EB8")
  palette.group <- c(rep(palette[1], num.control), rep(palette[2], num.treat))

  ## draw heatmap
  pdf(file = file.path(out.path, paste('heatmap_', name.control, '_', name.treat, '.pdf', sep = '')), width = 15, height = 12, title = 'Heatmap using the top features')
  heatmap(as.matrix(norm.table.deg), ColSideColors = palette.group, margins = c(9,5.5), labRow = gene.norm.table, cexRow = 1.9, cexCol = 1.9)
  legend("topleft", title = 'Group', legend=groups, text.font = 15,
         col = palette, fill = palette, cex=1.8)
  dev.off()
}

# the main function
for (ith.comparison in c(1:num.comparison)) {
  name.control <- controls[ith.comparison]
  name.treat <- treats[ith.comparison]
  plot.volcano.heatmap(name.control, name.treat)
}

R DESeq2 yaml EnhancedVolcano mygene hash plotscale From line 2 of scripts/visualize.R

shell:
    "hisat2-build -p {config[NCORE]} {input.trans} {params.index}"

SnakeMake HISAT2 From line 18 of workflow/align_count_cod_oldTrans.rules

run:
    shell("scp -i {input.key} {config[NELSIN]}/{wildcards.sample}_*_R1_001.fastq.gz {output.forward}")
    shell("scp -i {input.key} {config[NELSIN]}/{wildcards.sample}_*_R2_001.fastq.gz {output.reverse}")

SnakeMake scp From line 27 of workflow/align_count_cod_oldTrans.rules

shell:
    "hisat2 -p {config[NCORE]} -x {params.index} -1 {input.forward} -2 {input.reverse} -S {output.sam}"
    " && samtools view -b -S {output.sam} > {output.bam}"

SnakeMake SAMtools HISAT2 From line 43 of workflow/align_count_cod_oldTrans.rules

shell:
    "samtools sort {input.bam} -o {output.sort} && samtools index {output.sort}"

SnakeMake SAMtools From line 52 of workflow/align_count_cod_oldTrans.rules

shell:
    "samtools idxstats {input.sort} > {output.count}"

SnakeMake SAMtools From line 60 of workflow/align_count_cod_oldTrans.rules

shell:
    "cd ../scripts && javac -cp opencsv-1.8.jar:. sumgenescod.java && java -cp opencsv-1.8.jar:. sumgenescod codgenelist.csv {input}"

SnakeMake From line 68 of workflow/align_count_cod_oldTrans.rules

shell:
    "sh ../scripts/formatCount.sh {input.geneCount} {output.formatCount}"

SnakeMake From line 76 of workflow/align_count_cod_oldTrans.rules

run:
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward}"),
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse}")

SnakeMake scp From line 28 of workflow/align_count_genome.rules

shell:
    """
    shopt -s extglob
    scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read}
    """

SnakeMake scp extglob From line 39 of workflow/align_count_genome.rules

shell:
    "mkdir {output.indexes} && hisat2-build -p {config[NCORE]} {input.genome} {params.index}"
    "&& hisat2_extract_splice_sites.py {config[ANNOTATION]} > {output.splicesites}"

SnakeMake HISAT2 From line 53 of workflow/align_count_genome.rules

run:
    shell("hisat2 -p {config[NCORE]} --known-splicesite-infile {input.splicesites} -x {params.index} -1 {input.forward} -2 {input.reverse} -S {output.sam}")
    shell("samtools view -@ {config[NCORE]} -b -S {output.sam} > {output.bam}")

SnakeMake SAMtools HISAT2 From line 71 of workflow/align_count_genome.rules

run:
    shell("hisat2 -p {config[NCORE]} --known-splicesite-infile {input.splicesites} -x {params.index} -U {input.forward} -S {output.sam}")
    shell("samtools view -@ {config[NCORE]} -b -S {output.sam} > {output.bam}")

SnakeMake SAMtools HISAT2 From line 87 of workflow/align_count_genome.rules

shell:
    "samtools sort -@ {config[NCORE]} {input.bam} -o {output.sort}"

SnakeMake SAMtools From line 96 of workflow/align_count_genome.rules

run:
    if config["COUNTER"]=="featureCounts":
        if config["END"]=="pair":
            shell("featureCounts -p -T {config[NCORE]} -t exon -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
        else:
            shell("featureCounts -T {config[NCORE]} -t exon -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
    elif config["COUNTER"]=="htseq-count":
        shell("htseq-count -f bam -i {config[ATTRIBUTE]} -s no -t exon {input.sort} {input.annotation} | sed '/^__/ d' > {output.count}")

SnakeMake FeatureCounts htseqcount From line 106 of workflow/align_count_genome.rules

shell:
    "qualimap bamqc -bam {input.sort} -nt {config[NCORE]} --java-mem-size=6G -outdir {output.bamqc}"

SnakeMake QualiMap From line 121 of workflow/align_count_genome.rules

shell:
    "multiqc {input.bamqc} {input.count_summary} --filename {output.report}"

SnakeMake MultiQC From line 130 of workflow/align_count_genome.rules

shell:
    "multiqc {input.count_summary} --filename {output.report}"

SnakeMake MultiQC From line 138 of workflow/align_count_genome.rules

shell:
    "python scripts/combine2group_genome.py"

SnakeMake From line 26 of workflow/dea_genome.rules

shell:
    "Rscript scripts/dea_genome.R"

SnakeMake From line 37 of workflow/dea_genome.rules

shell:
    "Rscript scripts/combine2group_trans.R"

SnakeMake From line 27 of workflow/dea_trans.rules

shell:
    "Rscript scripts/dea_trans.R"

SnakeMake From line 39 of workflow/dea_trans.rules

shell:
    "Rscript scripts/combine2group_trans.R"

SnakeMake From line 53 of workflow/dea_trans.rules

shell:
    "Rscript scripts/dea_trans.R"

SnakeMake From line 62 of workflow/dea_trans.rules

run:
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward}"),
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse}")

SnakeMake scp From line 23 of workflow/quality_control.rules

shell:
    "fastqc -t $(({config[NCORE]}+0)) -o {params.outputpath} {input.forward} && "
    "fastqc -t $(({config[NCORE]}+0)) -o {params.outputpath} {input.reverse}"

SnakeMake FastQC From line 36 of workflow/quality_control.rules

shell:
    "multiqc {params.path} --filename {output.report}"

SnakeMake MultiQC From line 48 of workflow/quality_control.rules

shell:
    """
    shopt -s extglob
    scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read}
    """

SnakeMake scp extglob From line 58 of workflow/quality_control.rules

shell:
    "fastqc -t $(({config[NCORE]}+0)) -o {params.outputpath} {input.read}"

SnakeMake FastQC From line 71 of workflow/quality_control.rules

shell:
    "multiqc {params.path} --filename {output.report}"

SnakeMake MultiQC From line 81 of workflow/quality_control.rules

run:
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward_read}"),
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse_read}")

SnakeMake scp From line 27 of workflow/quantify_trans.rules

shell:
    """
    shopt -s extglob
    scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read}
    """

SnakeMake scp extglob From line 38 of workflow/quantify_trans.rules

shell:
    "salmon index -t {input} -i {output} --type quasi -k 31 -p {config[NCORE]}"

SnakeMake Salmon From line 49 of workflow/quantify_trans.rules

shell:
    """
    salmon quant -i {input.index} -l A -1 {input.forward_read} -2 {input.reverse_read} -o {output.quant_dir} -p {config[NCORE]} --seqBias --useVBOpt --validateMappings
    awk 'NR==1{{next}}{{print $1"\\t"$4}}' {output.quant_dir}/quant.sf > {output.tpm}
    """

SnakeMake Quant Salmon From line 61 of workflow/quantify_trans.rules

shell:
    """
    salmon quant -i {input.index} -l A -r {input.read} -o {output.quant_dir} -p {config[NCORE]} --seqBias --useVBOpt --validateMappings
    awk 'NR==1{{next}}{{print $1"\\t"$4}}' {output.quant_dir}/quant.sf > {output.tpm}
    """

SnakeMake Quant Salmon From line 74 of workflow/quantify_trans.rules

shell:
    "multiqc {input} --filename {output}"

SnakeMake MultiQC From line 85 of workflow/quantify_trans.rules

run:
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward}"),
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse}")

SnakeMake scp From line 31 of workflow/trim.rules

shell:
    "trim_galore --fastqc -j 4 --paired --basename {wildcards.sample} -o {params.outputpath} {input.forward} {input.reverse}"

SnakeMake Trim_Galore From line 44 of workflow/trim.rules

shell:
    """
    shopt -s extglob
    scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read}
    """

SnakeMake scp extglob From line 54 of workflow/trim.rules

shell:
    "trim_galore --fastqc -j 4 --basename {wildcards.sample} -o {params.outputpath} {input.read}"

SnakeMake Trim_Galore From line 67 of workflow/trim.rules

shell:
    "multiqc {params.path} --filename {output.report}"

SnakeMake MultiQC From line 77 of workflow/trim.rules

shell:
    "Rscript scripts/visualize.R {input.norm_path} {input.dea_path} {params.output_path}"

SnakeMake From line 30 of workflow/visualize.rules

shell:
    "Rscript scripts/visualize.R {input.norm_path} {input.dea_path} {params.output_path}"

SnakeMake From line 52 of workflow/visualize.rules

ShowHide 35 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/zhxiaokang/RASflow.git

Name: rasflow-rna-seq-analysis-snakemake-workflow

Version: Version 1

Badge:

Insert copied code into your website to add a link to this workflow.

Other Versions:

License: None

Keywords:

RNA-Seq analysis biomaRt EnhancedVolcano scp tximport DESeq2 edgeR FastQC FeatureCounts HISAT2 htseqcount MultiQC mygene Pandas QualiMap Quant Salmon SAMtools Snakemake hash plotscale yaml extglob numpy PyYAML Trim_Galore RNA-Seq

Refs:

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free