RASflow: RNA-Seq Analysis Snakemake Workflow
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
RASflow: RNA-Seq Analysis Snakemake Workflow
RASflow is a modular, flexible and user-friendly RNA-Seq analysis workflow.
RASflow can be applied to both model and non-model organisms. It supports mapping RNA-Seq raw reads to both genome and transcriptome (can be downloaded from public database or can be homemade by users) and it can do both transcript- and gene-level Differential Expression Analysis (DEA) when transcriptome is used as mapping reference. It requires little programming skill for basic use. If you're good at programming, you can do more magic with RASflow!
Code Snippets
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | import yaml import pandas as pd import numpy as np # import ipdb def load_globals(): global samples; global groups; global config; global input_path; global output_path with open('configs/config_main.yaml') as yamlfile: config = yaml.load(yamlfile) samples = np.array(pd.read_table(config["METAFILE"], header = 0)['sample']) groups = np.array(pd.read_table(config["METAFILE"], header = 0)['group']) input_path = config["FINALOUTPUT"] + "/" + config["PROJECT"] + "/genome/countFile" output_path = config["FINALOUTPUT"] + "/" + config["PROJECT"] + "/genome/dea" def main(): load_globals() groups_unique = np.unique(groups) num_samples = len(samples) num_groups_unique = len(groups_unique) id_list = get_id_list() for name_group in groups_unique: combine_group(name_group, id_list) def combine_group(name_group, id_list): index_group = [index for index, element in enumerate(groups) if element == name_group] # all the index of this group samples_group = samples[index_group] # the samples belonging to this group group_count = pd.DataFrame() group_count['ID'] = id_list for sample in samples_group: group_count["sample_" + str(sample)] = np.array(pd.read_table(input_path + "/" + str(sample) + "_count.tsv", header = None))[:, 1] # write to file group_count.to_csv(output_path + "/countGroup/" + str(name_group) + "_gene_count.tsv", sep = "\t", header = True, index = False) def get_id_list(): # extract the id list from the count file id_list = np.array(pd.read_table(input_path + '/' + str(samples[0]) + "_count.tsv", header = None))[:, 0] return id_list if __name__ == "__main__": main() |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | library(biomaRt) library(yaml) library(tximport) # ====================== define some functions ====================== # remove the version in the transcript ID remove_version <- function(files) { # input files (file names with directory) are output from Salmon for (i in c(1:length(files))) { quant.file <- files[i] quant.table <- read.table(quant.file, header = TRUE, stringsAsFactors = FALSE) trans.id.version <- quant.table$Name trans.id <- rep('ID', length(trans.id.version)) for (j in c(1:length(trans.id.version))) { trans.id[j] <- strsplit(trans.id.version[j], ".", fixed = TRUE)[[1]][1] } quant.table$Name <- trans.id write.table(quant.table, file.path(input.path, samples[i], "quant_noVersion.sf"), sep = "\t", quote = FALSE, row.names = FALSE) } return(trans.id) } # ====================== load parameters in config file ====================== # load the config file yaml.file <- yaml.load_file('configs/config_main.yaml') # extract the information from the yaml file project <- yaml.file$PROJECT # project name input.path <- file.path(yaml.file$FINALOUTPUT, project, "trans/quant") gene.level <- yaml.file$GENE_LEVEL controls <- yaml.file$CONTROL # all groups used as control treats <- yaml.file$TREAT # all groups used as treat, should correspond to control meta.file <- yaml.file$METAFILE ENSEMBL <- yaml.file$ENSEMBL dataset <- yaml.file$EnsemblDataSet tx2gene.file <- yaml.file$TX2GENE output.path <- file.path(yaml.file$FINALOUTPUT, project, "trans/dea") num.control <- length(controls) # number of comparisons that the user wants to do num.treat <- length(treats) # should equals to num.control if (num.control != num.treat) { message("Error: Control groups don't mathch with treat groups!") message("Please check config_dea.yaml") quit(save = 'no') } # extract the metadata meta.data <- read.table(meta.file, header = TRUE, sep = '\t', stringsAsFactors = FALSE) samples <- meta.data$sample group.all <- meta.data$group subject.all <- meta.data$subject # ====================== prepare the quant files ====================== # the original quant files from Salmon files <- file.path(input.path, samples, "quant.sf") names(files) <- samples # get the normalized abundance tables for all samples (scaled using the average transcript length over samples and then the library size (lengthScaledTPM)) # ====================== prepare the tx2gene table ====================== if (gene.level) { if (ENSEMBL) { ensembl <- useEnsembl(biomart = "ensembl", dataset = dataset) datasets <- listDatasets(ensembl) attributes <- listAttributes(mart = ensembl) # remove the version to get more matches trans.id <- remove_version(files) files.noVersion <- file.path(input.path, samples, "quant_noVersion.sf") names(files.noVersion) <- samples tx2gene <- getBM(attributes=c('ensembl_transcript_id', 'ensembl_gene_id'), filters = 'ensembl_transcript_id', values = trans.id, mart = ensembl) } else { tx2gene <- read.csv(tx2gene.file, header = FALSE, sep = "\t") } # save tx2gene output.file.tx2gene <- file.path(output.path, "countGroup", 'tx2gene.RData') save(tx2gene, file = output.file.tx2gene) } # ====================== get raw and normalized abundance tables ====================== trans.matrix <- tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "no") trans.count <- trans.matrix$counts trans.matrix.tpm <- tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "lengthScaledTPM") trans.count.tpm <- trans.matrix.tpm$counts if (gene.level) { if (ENSEMBL) { gene.matrix <- tximport(files.noVersion, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "no") gene.count <- gene.matrix$counts gene.matrix.tpm <- tximport(files.noVersion, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM") gene.count.tpm <- gene.matrix.tpm$counts } else { gene.matrix <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "no") gene.count <- gene.matrix$counts gene.matrix.tpm <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM") gene.count.tpm <- gene.matrix.tpm$counts } } # ====================== combine the samples to groups ====================== for (group in unique(group.all)) { index.group = which(group.all == group) # all the index of this group samples.group = samples[index.group] # the samples belonging to this group group.count.trans <- trans.count[, index.group] group.count.trans.tpm <- trans.count.tpm[, index.group] if (gene.level) { group.count.gene <- gene.count[, index.group] group.count.gene.tpm <- gene.count.tpm[, index.group] } # write to files output.file.trans <- file.path(output.path, "countGroup", paste(group, "_trans_abundance.tsv", sep = "")) write.table(group.count.trans, output.file.trans, sep = '\t', quote = FALSE, row.names = TRUE, col.names = TRUE) output.file.trans.norm <- file.path(output.path, "countGroup", paste(group, "_trans_norm.tsv", sep = "")) write.table(group.count.trans.tpm, output.file.trans.norm, sep = '\t', quote = FALSE, row.names = TRUE, col.names = TRUE) if (gene.level) { output.file.gene <- file.path(output.path, "countGroup", paste(group, "_gene_abundance.tsv", sep = "")) write.table(group.count.gene, output.file.gene, sep = '\t', quote = FALSE, row.names = TRUE, col.names = TRUE) output.file.gene.norm <- file.path(output.path, "countGroup", paste(group, "_gene_norm.tsv", sep = "")) write.table(group.count.gene.tpm, output.file.gene.norm, sep = '\t', quote = FALSE, row.names = TRUE, col.names = TRUE) } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | library(yaml) library(edgeR) library(DESeq2) # ====================== define the function of DEA ====================== DEA <- function(control, treat) { count.control <- read.table(paste(output.path, '/countGroup/', control, '_gene_count.tsv', sep = ''), header = TRUE, row.names = 1) count.treat <- read.table(paste(output.path, '/countGroup/', treat, '_gene_count.tsv', sep = ''), header = TRUE, row.names = 1) count.table <- cbind(count.control, count.treat) # merge the control and treat tables together # number of samples in control and treat groups (should be the same if it's a pair test) num.sample <- ncol(count.table) num.sample.control <- ncol(count.control) num.sample.treat <- ncol(count.treat) # samples of two groups sample.control <- colnames(count.control) sample.treat <- colnames(count.treat) # save gene list in gene.list for extracting gene names later gene.list <- rownames(count.table) # get the sample id samples <- colnames(count.table) # define the subject and group; define the reference group (control) subject <- factor(subject.all[c(which(group.all == control), which(group.all == treat))]) group <- relevel(factor(group.all[c(which(group.all == control), which(group.all == treat))]), ref = control) # The design matrix if (pair.test) { design <- model.matrix(~subject+group) } else { design <- model.matrix(~group) } if (dea.tool == 'edgeR') { # use edgeR for DEA # normalize the two groups and save the normalized count table y.control <- DGEList(counts = count.control, genes = gene.list) y.treat <- DGEList(counts = count.treat, genes = gene.list) y.control <- calcNormFactors(y.control, method="TMM") count.table.control.norm <- cpm(y.control) write.table(count.table.control.norm, paste(output.path, '/countGroup/', control, '_gene_norm.tsv', sep = ''), quote = FALSE, sep = "\t") y.treat <- calcNormFactors(y.treat, method="TMM") count.table.treat.norm <- cpm(y.treat) write.table(count.table.treat.norm, paste(output.path, '/countGroup/', treat, '_gene_norm.tsv', sep = ''), quote = FALSE, sep = "\t") # Put the data into a DGEList object y <- DGEList(counts = count.table, genes = gene.list) # do DEA # Filtering if (filter.need) { countsPerMillion <- cpm(y) countCheck <- countsPerMillion > 1 keep <- which(rowSums(countCheck) > 1) y <- y[keep, ] } # Normalization y <- calcNormFactors(y, method="TMM") y$samples$subject <- subject y$samples$group <- group rownames(design) <- colnames(y) # Estimating the dispersion # estimate the NB dispersion for the dataset y <- estimateDisp(y, design, robust = TRUE) # Differential expression # determine differentially expressed genes # fit genewise glms fit <- glmFit(y, design) # conduct likelihood ratio tests for tumour vs normal tissue differences and show the top genes lrt <- glmLRT(fit) # the DEA result for all the genes # dea <- lrt$table toptag <- topTags(lrt, n = nrow(y$genes), p.value = 1) dea <- toptag$table # just to add one more column of FDR dea <- dea[order(dea$FDR, -abs(dea$logFC), decreasing = FALSE), ] # sort the table: ascending of FDR then descending of absolute valued of logFC # differentially expressed genes toptag <- topTags(lrt, n = nrow(y$genes), p.value = fdr.thr) deg <- toptag$table if (!is.null(deg)) { deg <- deg[order(deg$FDR, -abs(deg$logFC), decreasing = FALSE), ] # sort the table: ascending of FDR then descending of absolute valued of logFC } # save the DEA result and DEGs to files write.table(dea, paste(output.path, '/DEA/dea_', control, '_', treat, '.tsv', sep = ''), row.names = F, quote = FALSE, sep = '\t') write.table(deg, paste(output.path, '/DEA/deg_', control, '_', treat, '.tsv', sep = ''), row.names = F, quote = FALSE, sep = '\t') } else if (dea.tool == "DESeq2") { # use DESeq2 for DEA ## create the DESeqDataSet colData = data.frame(samples, subject, group) dds <- DESeqDataSetFromMatrix(count.table, colData = colData, design = design) # generate normalized counts dds <- estimateSizeFactors(dds) normalized_counts <- counts(dds, normalized=TRUE) normalized_counts.control <- normalized_counts[, which(colnames(normalized_counts) %in% sample.control)] write.table(normalized_counts.control, paste(output.path, '/countGroup/', control, '_gene_norm.tsv', sep = ''), quote = FALSE, sep = "\t") normalized_counts.treat <- normalized_counts[, which(colnames(normalized_counts) %in% sample.treat)] write.table(normalized_counts.treat, paste(output.path, '/countGroup/', treat, '_gene_norm.tsv', sep = ''), quote = FALSE, sep = "\t") ## Filtering if (filter.need) { keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] } ## perform DEA dds <- DESeq(dds) ## export the results res.dea <- results(dds) res.dea <- res.dea[complete.cases(res.dea), ] # remove any rows with NA dea <- as.data.frame(res.dea) dea <- dea[order(dea$padj, -abs(dea$log2FoldChange), decreasing = FALSE), ] # sort the table: ascending of FDR then descending of absolute valued of logFC deg <- dea[dea$padj < fdr.thr, ] if (nrow(deg) > 1) { deg <- deg[order(deg$padj, -abs(deg$log2FoldChange), decreasing = FALSE), ] # sort the table: ascending of FDR then descending of absolute valued of logFC } # save the DEA result and DEGs to files write.table(dea, paste(output.path, '/DEA/dea_', control, '_', treat, '.tsv', sep = ''), row.names = T, quote = FALSE, sep = '\t') write.table(deg, paste(output.path, '/DEA/deg_', control, '_', treat, '.tsv', sep = ''), row.names = T, quote = FALSE, sep = '\t') } } # load the config file yaml.file <- yaml.load_file('configs/config_main.yaml') # extract the information from the yaml file project <- yaml.file$PROJECT # project name of this analysis dea.tool <- yaml.file$DEATOOL # tool used for DEA controls <- yaml.file$CONTROL # all groups used as control treats <- yaml.file$TREAT # all groups used as treat, should correspond to control filter.need <- yaml.file$FILTER$yesOrNo pair.test <- yaml.file$PAIR fdr.thr <- yaml.file$FDR # threshold of FDR/adjusted P-value for significantlly differentially expressed genes meta.file <- yaml.file$METAFILE output.path <- file.path(yaml.file$FINALOUTPUT, project, "genome/dea") # extract the metadata meta.data <- read.csv(meta.file, header = TRUE, sep = '\t') group.all <- meta.data$group subject.all <- meta.data$subject num.control <- length(controls) # number of comparisons that the user wants to do num.treat <- length(treats) # should equals to num.control if (num.control != num.treat) { message("Error: Control groups don't mathch with treat groups!") message("Please check config_dea.yaml") quit(save = 'no') } num.comparison <- num.control # Do DEA # the main function for (ith.comparison in c(1:num.comparison)) { control <- controls[ith.comparison] treat <- treats[ith.comparison] DEA(control, treat) } |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 | library(biomaRt) library(yaml) library(edgeR) library(DESeq2) library(tximport) # ====================== define the function of DEA ====================== DEA <- function(control, treat, file.control, file.treat, output.path.dea) { count.control <- read.table(file.control, header = TRUE, row.names = 1, check.names = FALSE) count.treat <- read.table(file.treat, header = TRUE, row.names = 1, check.names = FALSE) count.table <- cbind(count.control, count.treat) # merge the control and treat tables together # number of samples in control and treat groups (should be the same if it's a pair test) num.sample.control <- ncol(count.control) num.sample.treat <- ncol(count.treat) # save gene list in gene.list for extracting gene names later gene.list <- rownames(count.table) # get the sample id samples <- colnames(count.table) # define the subject and group; define the reference group (control) subject <- factor(subject.all[c(which(group.all == control), which(group.all == treat))]) group <- relevel(factor(group.all[c(which(group.all == control), which(group.all == treat))]), ref = control) # The design matrix if (pair.test) { design <- model.matrix(~subject+group) } else { design <- model.matrix(~group) } if (dea.tool == 'edgeR') { # use edgeR for DEA # Put the data into a DGEList object y <- DGEList(counts = count.table, genes = gene.list) # Filtering if (filter.need) { countsPerMillion <- cpm(y) countCheck <- countsPerMillion > 1 keep <- which(rowSums(countCheck) > 1) y <- y[keep, ] } # Normalization y <- calcNormFactors(y, method="TMM") y$samples$subject <- subject y$samples$group <- group rownames(design) <- colnames(y) # Estimating the dispersion # estimate the NB dispersion for the dataset y <- estimateDisp(y, design, robust = TRUE) # Differential expression # determine differentially expressed genes # fit genewise glms fit <- glmFit(y, design) # conduct likelihood ratio tests for tumour vs normal tissue differences and show the top genes lrt <- glmLRT(fit) # the DEA result for all the genes # dea <- lrt$table toptag <- topTags(lrt, n = nrow(y$genes), p.value = 1) dea <- toptag$table # just to add one more column of FDR dea <- dea[order(dea$FDR, -abs(dea$logFC), decreasing = FALSE), ] # sort the table: ascending of FDR then descending of absolute valued of logFC # differentially expressed genes toptag <- topTags(lrt, n = nrow(y$genes), p.value = fdr.thr) deg <- toptag$table if (!is.null(deg)) { deg <- deg[order(deg$FDR, -abs(deg$logFC), decreasing = FALSE), ] # sort the table: ascending of FDR then descending of absolute valued of logFC } # save the DEA result and DEGs to files write.table(dea, paste(output.path.dea, '/dea_', control, '_', treat, '.tsv', sep = ''), row.names = F, quote = FALSE, sep = '\t') write.table(deg, paste(output.path.dea, '/deg_', control, '_', treat, '.tsv', sep = ''), row.names = F, quote = FALSE, sep = '\t') } else if (dea.tool == "DESeq2") { # use DESeq2 for DEA ## prepare txi if (gene.level & gene.level.flag) { ### load tx2gene load(file.path(output.path, "countGroup", 'tx2gene.RData')) if (ENSEMBL) { files.noVersion <- file.path(quant.path, samples, "quant_noVersion.sf") names(files.noVersion) <- samples ### import quantification as txi txi <- tximport(files.noVersion, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "no") } else { ### the original quant files from Salmon files <- file.path(quant.path, samples, "quant.sf") names(files) <- samples ### import quantification as txi txi <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "no") } } else { ### the original quant files from Salmon files <- file.path(quant.path, samples, "quant.sf") names(files) <- samples ### import quantification as txi txi <- tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "no") } ## create the DESeqDataSet colData = data.frame(samples, subject, group) dds <- DESeqDataSetFromTximport(txi, colData = colData, design = design) # Filtering if (filter.need) { keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] } ## specify the control group dds$group <- relevel(dds$group, ref = control) ## perform DEA dds <- DESeq(dds) ## export the results res.dea <- results(dds) res.dea <- res.dea[complete.cases(res.dea), ] # remove any rows with NA dea <- as.data.frame(res.dea) dea <- dea[order(dea$padj, -abs(dea$log2FoldChange), decreasing = FALSE), ] # sort the table: ascending of FDR then descending of absolute valued of logFC deg <- dea[dea$padj < fdr.thr, ] if (nrow(deg) > 1) { deg <- deg[order(deg$padj, -abs(deg$log2FoldChange), decreasing = FALSE), ] # sort the table: ascending of FDR then descending of absolute valued of logFC } # save the DEA result and DEGs to files write.table(dea, paste(output.path.dea, '/dea_', control, '_', treat, '.tsv', sep = ''), row.names = T, quote = FALSE, sep = '\t') write.table(deg, paste(output.path.dea, '/deg_', control, '_', treat, '.tsv', sep = ''), row.names = T, quote = FALSE, sep = '\t') } } # ====================== load parameters in config file ====================== # load the config file yaml.file <- yaml.load_file('configs/config_main.yaml') # extract the information from the yaml file project <- yaml.file$PROJECT # project name dea.tool <- yaml.file$DEATOOL # tool used for DEA quant.path <- file.path(yaml.file$FINALOUTPUT, project, "trans/quant") gene.level <- yaml.file$GENE_LEVEL # whether to do gene-level analysis controls <- yaml.file$CONTROL # all groups used as control treats <- yaml.file$TREAT # all groups used as treat, should correspond to control filter.need <- yaml.file$FILTER$yesOrNo pair.test <- yaml.file$PAIR meta.file <- yaml.file$METAFILE ENSEMBL <- yaml.file$ENSEMBL dataset <- yaml.file$EnsemblDataSet output.path <- file.path(yaml.file$FINALOUTPUT, project, "trans/dea") fdr.thr <- yaml.file$FDR # threshold of FDR/adjusted P-value for significantlly differentially expressed genes num.control <- length(controls) # number of comparisons that the user wants to do num.treat <- length(treats) # should equals to num.control if (num.control != num.treat) { message("Error: Control groups don't mathch with treat groups!") message("Please check config_dea.yaml") quit(save = 'no') } num.comparison <- num.control # extract the metadata meta.data <- read.csv(meta.file, header = TRUE, sep = '\t', stringsAsFactors = FALSE) samples <- meta.data$sample group.all <- meta.data$group subject.all <- meta.data$subject # ====================== Do DEA ====================== for (ith.comparison in c(1:num.comparison)) { control <- controls[ith.comparison] treat <- treats[ith.comparison] # --------------------- On transctipt level --------------------- file.control <- paste(output.path, '/countGroup/', control, '_trans_abundance.tsv', sep = '') file.treat <- paste(output.path, '/countGroup/', treat, '_trans_abundance.tsv', sep = '') output.path.dea <- paste(output.path, '/DEA/transcript-level', sep = '') gene.level.flag <- FALSE DEA(control, treat, file.control, file.treat, output.path.dea) # --------------------- On gene level --------------------- if (gene.level) { file.control <- paste(output.path, '/countGroup/', control, '_gene_abundance.tsv', sep = '') file.treat <- paste(output.path, '/countGroup/', treat, '_gene_abundance.tsv', sep = '') output.path.dea <- paste(output.path, '/DEA/gene-level', sep = '') gene.level.flag <- TRUE DEA(control, treat, file.control, file.treat, output.path.dea) } } |
9 | awk -F'"' '{print $2"\t"$6}' $1 > $2 |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | if (!require("plotscale")) install.packages('scripts/plotscale_0.1.6.tar.gz', repos = NULL, type="source") library(yaml) library(hash) library(mygene) library(EnhancedVolcano) library(plotscale) # ====================== load parameters in config file ====================== # passing the params from command line args <- commandArgs(TRUE) norm.path <- args[1] dea.path <- args[2] out.path <- args[3] # load the config file if (length(args) > 3) { # this script is used in visualize_test.rules yaml.file <- yaml.load_file(args[4]) } else { # this script is used in visualize.rules yaml.file <- yaml.load_file('configs/config_main.yaml') } # extract the information from the yaml file controls <- yaml.file$CONTROL treats <- yaml.file$TREAT dea.tool <- yaml.file$DEATOOL # check the number of comparisons num.control <- length(controls) # number of comparisons that the user wants to do num.treat <- length(treats) # should equals to num.control if (num.control != num.treat) { message("Error: Control groups don't mathch with treat groups!") message("Please check config_dea.yaml") quit(save = 'no') } num.comparison <- num.control convert.id2symbol <- function(gene.id) { gene.symbol <- gene.id # initialize the gene symbol with the gene id # it may happen that no symbol can be found for any id. In that case, "queryMany" will throw an error # so "try" is used here to take care of that error try({ gene.symbol.all <- queryMany(gene.id, scopes = 'ensembl.gene', fields = 'symbol') h <- hash() for (i in 1:nrow(gene.symbol.all)) { query <- gene.symbol.all$query[i] symbol <- gene.symbol.all$symbol[i] if (has.key(query, h)) { # if there's duplicate for the same query h[[query]] <- paste(hash::values(h, keys = query), symbol, sep = ', ') } else { if (is.na(symbol)) { # if there's no hit for the query, keep the original id h[[query]] <- query } else { h[[query]] <- symbol } } } for (i in c(1:length(gene.symbol))) { gene.symbol[i] <- h[[gene.id[i]]] } }) return(gene.symbol) } # function to plot volcano plot and heatmap plot.volcano.heatmap <- function(name.control, name.treat) { file.dea.table <- paste(dea.path, "/dea_", name.control, "_", name.treat, ".tsv", sep = "") norm.control <- paste(norm.path, "/", name.control, "_gene_norm.tsv", sep = "") # normalized table of control norm.treat <- paste(norm.path, "/", name.treat, "_gene_norm.tsv", sep = "") # normalized table of treat dea.table <- read.table(file.dea.table, header = TRUE, row.names = 1) # sort the dea table: ascending of FDR then descending of absolute valued of logFC if (dea.tool == 'edgeR') { dea.table <- dea.table[order(dea.table$FDR, -abs(dea.table$logFC), decreasing = FALSE), ] } else if (dea.tool == 'DESeq2') { dea.table <- dea.table[order(dea.table$padj, -abs(dea.table$log2FoldChange), decreasing = FALSE), ] } gene.id.dea <- row.names(dea.table) # convert the gene id to gene symbol, and keep the ID if no symbol can be found for that ID gene.dea <- convert.id2symbol(gene.id.dea) # volcano plot dea.table.volcano <- dea.table # for better volcano plot, 0 FDRs/padj will be changed to a very low value if (dea.tool == 'edgeR') { # change the 0 FDR to a low value (100 times smaller than the minumum non-zero value) FDR <- dea.table.volcano$FDR FDR.min.non.zero <- min(FDR[FDR>0]) dea.table.volcano$FDR[FDR==0] <- FDR.min.non.zero/100 ## define the range of x-axis and y-axis log2FC_lim <- max(abs(dea.table.volcano$logFC)) FDR_lim <- -log10(min(dea.table.volcano$FDR)) # NAs already removed from dea.table fig.volcano <- EnhancedVolcano(dea.table.volcano, lab = gene.dea, xlab = bquote(~Log[2]~ "fold change"), x = 'logFC', y = 'FDR', pCutoff = 10e-5, col = c("grey30", "orange2", "royalblue", "red2"), FCcutoff = 1, xlim = c(-log2FC_lim-1, log2FC_lim+1), ylim = c(0, FDR_lim), title = NULL, subtitle = NULL) } else if (dea.tool == 'DESeq2') { # change the 0 padj to a low value (100 times smaller than the minumum non-zero value) padj <- dea.table.volcano$padj padj.min.non.zero <- min(padj[padj>0]) dea.table.volcano$padj[padj==0] <- padj.min.non.zero/100 ## define the range of x-axis and y-axis log2FC_lim <- max(abs(dea.table.volcano$log2FoldChange)) padj_lim <- -log10(min(dea.table.volcano$padj)) # NAs already removed from dea.table fig.volcano <- EnhancedVolcano(dea.table.volcano, lab = gene.dea, xlab = bquote(~Log[2]~ "fold change"), x = 'log2FoldChange', y = 'padj', pCutoff = 10e-5, col = c("grey30", "orange2", "royalblue", "red2"), FCcutoff = 1, xlim = c(-log2FC_lim-1, log2FC_lim+1), ylim = c(0, padj_lim), title = NULL, subtitle = NULL) } # ## if the range of x-axis or y-axis gets crazy, you may also manually define it to show a subset of the genes (but to be noted, the genes out of range will not be shown) # if (dea.tool == 'edgeR') { # fig.volcano <- EnhancedVolcano(dea.table, lab = gene.dea, xlab = bquote(~Log[2]~ "fold change"), x = 'logFC', y = 'FDR', pCutoff = 10e-5, col = c("grey30", "orange2", "royalblue", "red2"), # FCcutoff = 1, xlim = c(-5, 5), ylim = c(0, 10), transcriptPointSize = 1.5, title = NULL, subtitle = NULL) # } else if (dea.tool == 'DESeq2') { # fig.volcano <- EnhancedVolcano(dea.table, lab = gene.dea, xlab = bquote(~Log[2]~ "fold change"), x = 'log2FoldChange', y = 'padj', pCutoff = 10e-5, col = c("grey30", "orange2", "royalblue", "red2"), # FCcutoff = 1, xlim = c(-5, 5), ylim = c(0, 10), transcriptPointSize = 1.5, title = NULL, subtitle = NULL) # } as.pdf(fig.volcano, width = 9, height = 6, scaled = TRUE, file = file.path(out.path, paste('volcano_plot_', name.control, '_', name.treat, '.pdf', sep = ''))) # heatmap norm.table.control <- read.table(norm.control, header = TRUE, row.names = 1) norm.table.treat <- read.table(norm.treat, header = TRUE, row.names = 1) num.control <- dim(norm.table.control)[2] num.treat <- dim(norm.table.treat)[2] norm.table <- cbind(norm.table.control, norm.table.treat) groups <- c(name.control, name.treat) # instead using all genes, only use the top 20 genes in dea.table index.deg <- which(row.names(norm.table) %in% gene.id.dea[1:20]) norm.table.deg <- norm.table[index.deg,] gene.id.norm.table <- rownames(norm.table.deg) # convert the gene id to gene symbol, and keep the ID if no symbol can be found for that ID gene.norm.table <- convert.id2symbol(gene.id.norm.table) palette <- c("#999999", "#377EB8") palette.group <- c(rep(palette[1], num.control), rep(palette[2], num.treat)) ## draw heatmap pdf(file = file.path(out.path, paste('heatmap_', name.control, '_', name.treat, '.pdf', sep = '')), width = 15, height = 12, title = 'Heatmap using the top features') heatmap(as.matrix(norm.table.deg), ColSideColors = palette.group, margins = c(9,5.5), labRow = gene.norm.table, cexRow = 1.9, cexCol = 1.9) legend("topleft", title = 'Group', legend=groups, text.font = 15, col = palette, fill = palette, cex=1.8) dev.off() } # the main function for (ith.comparison in c(1:num.comparison)) { name.control <- controls[ith.comparison] name.treat <- treats[ith.comparison] plot.volcano.heatmap(name.control, name.treat) } |
18 19 | shell: "hisat2-build -p {config[NCORE]} {input.trans} {params.index}" |
27 28 29 | run: shell("scp -i {input.key} {config[NELSIN]}/{wildcards.sample}_*_R1_001.fastq.gz {output.forward}") shell("scp -i {input.key} {config[NELSIN]}/{wildcards.sample}_*_R2_001.fastq.gz {output.reverse}") |
43 44 45 | shell: "hisat2 -p {config[NCORE]} -x {params.index} -1 {input.forward} -2 {input.reverse} -S {output.sam}" " && samtools view -b -S {output.sam} > {output.bam}" |
52 53 | shell: "samtools sort {input.bam} -o {output.sort} && samtools index {output.sort}" |
60 61 | shell: "samtools idxstats {input.sort} > {output.count}" |
68 69 | shell: "cd ../scripts && javac -cp opencsv-1.8.jar:. sumgenescod.java && java -cp opencsv-1.8.jar:. sumgenescod codgenelist.csv {input}" |
76 77 | shell: "sh ../scripts/formatCount.sh {input.geneCount} {output.formatCount}" |
28 29 30 | run: shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward}"), shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse}") |
39 40 41 42 43 | shell: """ shopt -s extglob scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read} """ |
53 54 55 | shell: "mkdir {output.indexes} && hisat2-build -p {config[NCORE]} {input.genome} {params.index}" "&& hisat2_extract_splice_sites.py {config[ANNOTATION]} > {output.splicesites}" |
71 72 73 | run: shell("hisat2 -p {config[NCORE]} --known-splicesite-infile {input.splicesites} -x {params.index} -1 {input.forward} -2 {input.reverse} -S {output.sam}") shell("samtools view -@ {config[NCORE]} -b -S {output.sam} > {output.bam}") |
87 88 89 | run: shell("hisat2 -p {config[NCORE]} --known-splicesite-infile {input.splicesites} -x {params.index} -U {input.forward} -S {output.sam}") shell("samtools view -@ {config[NCORE]} -b -S {output.sam} > {output.bam}") |
96 97 | shell: "samtools sort -@ {config[NCORE]} {input.bam} -o {output.sort}" |
106 107 108 109 110 111 112 113 | run: if config["COUNTER"]=="featureCounts": if config["END"]=="pair": shell("featureCounts -p -T {config[NCORE]} -t exon -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}") else: shell("featureCounts -T {config[NCORE]} -t exon -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}") elif config["COUNTER"]=="htseq-count": shell("htseq-count -f bam -i {config[ATTRIBUTE]} -s no -t exon {input.sort} {input.annotation} | sed '/^__/ d' > {output.count}") |
121 122 | shell: "qualimap bamqc -bam {input.sort} -nt {config[NCORE]} --java-mem-size=6G -outdir {output.bamqc}" |
130 131 | shell: "multiqc {input.bamqc} {input.count_summary} --filename {output.report}" |
138 139 | shell: "multiqc {input.count_summary} --filename {output.report}" |
26 27 | shell: "python scripts/combine2group_genome.py" |
37 38 | shell: "Rscript scripts/dea_genome.R" |
27 28 | shell: "Rscript scripts/combine2group_trans.R" |
39 40 | shell: "Rscript scripts/dea_trans.R" |
53 54 | shell: "Rscript scripts/combine2group_trans.R" |
62 63 | shell: "Rscript scripts/dea_trans.R" |
23 24 25 | run: shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward}"), shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse}") |
36 37 38 | shell: "fastqc -t $(({config[NCORE]}+0)) -o {params.outputpath} {input.forward} && " "fastqc -t $(({config[NCORE]}+0)) -o {params.outputpath} {input.reverse}" |
48 49 | shell: "multiqc {params.path} --filename {output.report}" |
58 59 60 61 62 | shell: """ shopt -s extglob scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read} """ |
71 72 | shell: "fastqc -t $(({config[NCORE]}+0)) -o {params.outputpath} {input.read}" |
81 82 | shell: "multiqc {params.path} --filename {output.report}" |
27 28 29 | run: shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward_read}"), shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse_read}") |
38 39 40 41 42 | shell: """ shopt -s extglob scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read} """ |
49 50 | shell: "salmon index -t {input} -i {output} --type quasi -k 31 -p {config[NCORE]}" |
61 62 63 64 65 | shell: """ salmon quant -i {input.index} -l A -1 {input.forward_read} -2 {input.reverse_read} -o {output.quant_dir} -p {config[NCORE]} --seqBias --useVBOpt --validateMappings awk 'NR==1{{next}}{{print $1"\\t"$4}}' {output.quant_dir}/quant.sf > {output.tpm} """ |
74 75 76 77 78 | shell: """ salmon quant -i {input.index} -l A -r {input.read} -o {output.quant_dir} -p {config[NCORE]} --seqBias --useVBOpt --validateMappings awk 'NR==1{{next}}{{print $1"\\t"$4}}' {output.quant_dir}/quant.sf > {output.tpm} """ |
85 86 | shell: "multiqc {input} --filename {output}" |
31 32 33 | run: shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward}"), shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse}") |
44 45 | shell: "trim_galore --fastqc -j 4 --paired --basename {wildcards.sample} -o {params.outputpath} {input.forward} {input.reverse}" |
54 55 56 57 58 | shell: """ shopt -s extglob scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read} """ |
67 68 | shell: "trim_galore --fastqc -j 4 --basename {wildcards.sample} -o {params.outputpath} {input.read}" |
77 78 | shell: "multiqc {params.path} --filename {output.report}" |
30 31 | shell: "Rscript scripts/visualize.R {input.norm_path} {input.dea_path} {params.output_path}" |
52 53 | shell: "Rscript scripts/visualize.R {input.norm_path} {input.dea_path} {params.output_path}" |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/zhxiaokang/RASflow.git
Name:
rasflow-rna-seq-analysis-snakemake-workflow
Version:
Version 1
Other Versions:
Downloaded:
0
Copyright:
Public Domain
License:
None
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...