A Snakemake pipeline to analyse RNA-Seq reads
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
- Lack of a description for a new keyword DESeq2 .
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
RNA_seq_Snakemake
A snakemake pipeline for the analysis of RNA-seq data that makes use of hisat2 and DESeq2 .
Aim
To align, count, normalize counts and compute gene differential expressions between conditions using paired-end Illumina RNA-seq data.
Description
This pipeline analyses the raw RNA-seq data and produces a file containing normalized counts, differential expression, numbers of the clusters the genes have been assigned to and functions of transcripts. The raw fastq files will be trimmed for adaptors and quality checked with trimmomatic. Next, the necessary genome sequence fastas will be downloaded to be used for the mapping of the trimmed reads using hisat2. With stringtie and a downloaded reference annotation a new annotation will be created. This new annotation will be used to obtain the raw counts and do a local blast to a reference transcriptome fasta containing predicted functions. The counts are normalized and differential expressions are calculated using DESeq2. significantly expressed genes are used to create heatmaps and plots both in total and clustered. finally the DESeq2, blast and clustering data are combined to get the final results table.
Prerequisites: what you should be able to do before using this Snakemake pipeline
-
Some command of the Unix Shell to connect to a remote server where you will execute the pipeline (e.g. SURF Lisa Cluster). You can find a good tutorial from the Software Carpentry Foundation here and another one from Berlin Bioinformatics here .
-
Some command of the Unix Shell to transfer datasets to and from a remote server (to transfer sequencing files and retrieve the results/). The Berlin Bioinformatics Unix begginer guide available [here] should be sufficient for that (check the
wget
andscp
commands). -
An understanding of the steps of a canonical RNA-Seq analysis (trimming, alignment, etc.). You can find some info here .
Content
-
Snakefile
: a master file that contains the desired outputs and the rules to generate them from the input files. -
config.yaml
: the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules. -
data/
: a folder containing samples.txt (sample descriptions) and subsetted paired-end fastq files used to test locally the pipeline. Generated using Seqtk :seqtk sample -s100 {inputfile(can be gzipped)} 250000 > {output(always gunzipped)}
This folder should contain thefastq
of the paired-end RNA-seq data, you want to run. -
envs/
: a folder containing the environments needed for the conda package manager. If run with the--use-conda
command, Snakemake will install the necessary softwares and packages using the conda environment files. -
samples.tsv
: a file containing information about the names, the paths and the conditions of the samples used as input for the pipeline. This file has to be adapted to your sample names before running the pipeline .
Usage
Download or clone the Github repository
You will need a local copy of the
Snakemake_hisat-DESeq
on your machine. You can either:
-
use git in the shell:
git clone git@github.com:KoesGroup/Snakemake_hisat-DESeq.git
-
click on "Clone or download" and select
download
Installing and activating a virtual environment
First, you need to create an environment where
Snakemake
and the python
pandas
package will be installed. To do that, we will use the conda package manager.
-
Create a virtual environment named
rnaseq
using theglobal_env.yaml
file with the following command:conda env create --name rnaseq --file envs/global_env.yaml
Then, activate this virtual environment withsource activate rnaseq
The Snakefile will then take care of installing and loading the packages and softwares required by each step of the pipeline.
Configuration file
Make sure you have changed the parameters in the
config.yaml
file that specifies where to find the sample data file, the genomic and transcriptomic reference fasta files to use and the parameters for certains rules etc.
This file is used so the
Snakefile
does not need to be changed when locations or parameters need to be changed.
Snakemake execution
The Snakemake pipeline/workflow management system reads a master file (often called
Snakefile
) to list the steps to be executed and defining their order. It has many rich features. Read more
here
.
Dry run
From the folder containing the
Snakefile
, use the command
snakemake --use-conda -np
to perform a dry run that prints out the rules and commands.
Real run
Simply type
Snakemake --use-conda
and provide the number of cores with
--cores 10
for ten cores for instance.
For cluster execution, please refer to the
Snakemake reference
.
Main outputs
-
the RNA-Seq read alignment files *.bam
-
the fastqc report files *.html
-
the unscaled RNA-Seq read counts: counts.txt
-
the differential expression file results.tsv
-
the combined results file final.txt
Directed Acyclic Graph of jobs
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | library(DESeq2) library(optparse) # arguments to provide option_list = list( make_option(c("-c", "--counts"), type="character", default="results/counts.txt", help="counts tabulated file from Feature Counts", metavar="character"), make_option(c("-s", "--samplefile"), type="character", default="data/samples2.tsv", help="sample files used to get conditions for DESEq2 model fit", metavar="character"), make_option(c("-o", "--outdir"), type="character", default="results/deseqNew.csv", help="where to place differential expression files", metavar="character"), make_option(c("-f", "--helperfile"), type="character", default="results/helperfile.csv", help="helper file needed for the creation of the clustering and the heatmaps", metavar="character"), make_option(c("-m", "--maxfraction"), type="double", default=1.0, help="maximum fraction of the total number of genes to be allowed to be differential between two conditions to be included (number between 0 and 1)", metavar="double") ) # parse the command-line arguments and pass them to a list called 'opt' opt_parser = OptionParser(option_list=option_list); opt = parse_args(opt_parser); ######## Import and wrangle counts produced by FeatureCounts ####### countdata <- read.table(opt$counts, header=TRUE, skip=1, row.names=1,stringsAsFactors = F, check.names = F) countdata <- countdata[ ,6:ncol(countdata)] # Remove first five columns (chr, start, end, strand, length) # Remove .bam or .sam from filenames colnames(countdata) <- gsub("\\.[sb]am$", "", colnames(countdata)) # Convert to matrix countdata <- as.matrix(countdata) ######### Extract experimental conditions from the file specifying the samples samplefile = read.delim(file = opt$samplefile,header = T,stringsAsFactors = F) col.names = colnames(samplefile) # extract all column names columns.to.discard = c("sample","fq1","fq2") # column we don't need to extract all conditions colsForConditions = col.names[! col.names %in% columns.to.discard] # only keeping the column of interest # one condition if (length(colsForConditions) == 1){ condition <- factor(samplefile[,colsForConditions]) # two conditions } else if (length(colsForConditions == 2)){ # two conditions --> make a third column that is the product of the two samplefile$conditions = paste(samplefile[,colsForConditions[1]],samplefile[,colsForConditions[2]],sep = ".") condition <- factor(x = samplefile[,"conditions"],levels = unique(samplefile[,"conditions"])) } else if (length(conditions > 2)){ print("too many conditions to compare, skipping") } # create helper file for downstream use in the pipeline. helperFile <- as.data.frame(condition) rownames(helperFile) <- samplefile$sample colnames(helperFile) <- NULL print(helperFile) write.csv(helperFile, file = opt$helperfile, quote = F, col.names = F, row.names = T) # Analysis with DESeq2 ---------------------------------------------------- # Create a coldata frame and instantiate the DESeqDataSet. See ?DESeqDataSetFromMatrix coldata <- data.frame(row.names=colnames(countdata), condition) dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition) # Run the DESeq pipeline dds <- DESeq(dds) # create dataframe containing normalized counts, to which the differential expression values will be added resdata <- as.data.frame(counts(dds, normalized=TRUE)) ## function for Volcano plot with "significant" genes labeled volcanoplot <- function (res, lfcthresh=2, sigthresh=0.05, main="Volcano Plot", legendpos="bottomright", labelsig=TRUE, textcx=1, ...) { with(res, plot(log2FoldChange, -log10(padj), pch=20, main=main, ...)) with(subset(res, padj<sigthresh ), points(log2FoldChange, -log10(padj), pch=20, col="red", ...)) with(subset(res, abs(log2FoldChange)>lfcthresh), points(log2FoldChange, -log10(padj), pch=20, col="orange", ...)) with(subset(res, padj<sigthresh & abs(log2FoldChange)>lfcthresh), points(log2FoldChange, -log10(padj), pch=20, col="green", ...)) if (labelsig) { require(calibrate) with(subset(res, padj<sigthresh & abs(log2FoldChange)>lfcthresh), textxy(log2FoldChange, -log10(padj), labs=colnames(res), cex=textcx, ...)) } legend(legendpos, xjust=1, yjust=1, legend=c(paste("FDR<",sigthresh,sep=""), paste("|LogFC|>",lfcthresh,sep=""), "both"), pch=20, col=c("red","orange","green")) } # open file that will be containing a vulcano plot for each of the combinations of conditions #pdf(file="vulcanoplots.pdf") # iterate through the list of conditions to create differential expression (DE) values for all possible pairs x <- 1 for(i in levels(condition)){ x <- x + 1 if (x <= length(levels(condition))){ for(j in x:length(levels(condition))){ res <- results(dds, contrast=c("condition", i, levels(condition)[j])) # i = first in pair, levels(condition)[j] is the second in pair. d <- paste(i, levels(condition)[j], sep="&") # paste the two conditions in one character, to be used as the pair name resP <- as.data.frame(table(res$padj<0.05)) # get number of DE values with P-value < 0.05 if(resP[1,2]< opt$maxfraction*nrow(resdata)){ # only continue with the pair if it is less then the maximum fraction(set by user in commandline) differentially expressed #volcanoplot(res, lfcthresh=1, sigthresh=0.05, textcx=.8, xlim=c(-10, 10), main=d) print(c(d,"Number of differentials is within accepted limit")) colnames(res) = paste(d,c(colnames(res)),sep = "-") # paste the pair name to the column name resdata <- merge(as.data.frame(resdata), as.data.frame(res), by="row.names", sort=FALSE) # merge the DE values to the matrix rownames(resdata) <- resdata$Row.names # redifine rownames as they have disapeared?? in the merging?? resdata$Row.names <- NULL # delete column containing the rownames, to avoid the same colname in the next iteration } else{ print(c(d,"More differentials then allowed")) } } } } #dev.off() # create first column containing the genenames. resdata$genes <- rownames(resdata) resdata <- merge(as.data.frame(resdata["genes"]), as.data.frame(resdata[,2:ncol(resdata)-1]), by="row.names", sort=FALSE) resdata$Row.names <- NULL # write the data to a file. write.table(resdata, file=opt$outdir,sep = "\t",quote=F,row.names=F) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | from optparse import OptionParser import glob import os # get the command line arguments parser = OptionParser(description="Script that combines the normalized and differential expression data as outputted by DESeq2.R, the clusters as outputted by plotscript.R and the results from the blast against the reference proteome. Resulting in a combined tab-delimeted data file") parser.add_option('-r', '--result_file', type=str, default="results/result.csv", metavar="", help = "path and name of of the input file, being the output file of the R-script DESeq2.R, default = results/result.csv") parser.add_option('-a', '--anno_file', type=str, default="annotations/Petunia_axillaris_v1.6.2_proteins.mapman_annotation.txt", metavar="", help = "path and name of the stringtie transcriptome fasta file, a file containing the fastas of the de novo generated stringtie transcriptome, default = result/helperFile.csv") parser.add_option('-c', '--cluster_file', type=str, default="results/clusters.txt", metavar="", help = "path and name of the clusters file, a tab delimited file containing cluster numbers and depending on the method of clustering correlation values or membership values., default = result/helperFile.csv") parser.add_option('-o', '--output_file', type=str, default="results/final.txt", metavar="", help = "path and name of the output file. A tab delimimited file containing normalised reads, differential expressions and (adjusted)p-values, number of clusters the genes partisioned to and the hypothetical function as found by a blast. default = result/final.txt") parser.add_option('-p', '--working_dir', type=str, default="temp/mapped/", metavar="", help = "path to bam files, to be removed from the sample names in the header of the final output file. default = temp/mapped/") (options, args) = parser.parse_args() clusts = open(options.cluster_file, "r") clusters = {} for line in clusts: line = line.rstrip() line = line.split("\t") if len(line) == 2: line.append("NaN") clusters[line[0].lower()] = "\t".join(line[1:]) clusts.close() # if there is an annotation file available add annotaions. if options.anno_file in glob.glob(options.anno_file): if glob.glob(options.anno_file)[0].endswith(".gz"): os.system(f"gunzip {options.anno_file}") annoFile = ".".join(options.anno_file.split(".")[:-1]) else: annoFile = options.anno_file fa = open(annoFile, "r") annos = {} count = 0 for l in fa: l = l.split("\t") l = [x.strip("'") for x in l] if len(l)>2: if len(l[2])>1: if l[2].lower() not in annos: annos[l[2].lower()] = ".".join(l[1].split(".")[:3]).replace(" ", "_") else: annos[l[2].lower()] += (";" + ".".join(l[1].split(".")[:3]).replace(" ", "_")) fa.close() inFile = open(options.result_file, "r") uitFile = open(options.output_file, "w") for l in inFile: l = l.rstrip().split("\t") if "genes" in l[0]: l = [x.replace(options.working_dir, "") for x in l] # remove path from sample names l.append("\t".join([clusters["gene"], "annotation"])) elif l[0].lower() in clusters: if l[0].lower() in annos: l.append(clusters[l[0].lower()] + "\t" + annos[l[0].lower()]) else: l.append(clusters[l[0].lower()] + "\tno annotation available.") elif l[0].lower() in annos: l.append("NaN\tNaN\t" + annos[l[0].lower()]) else: l.append("NaN\tNaN\tno annotation available.") uitFile.write("\t".join(l)+"\n") inFile.close() uitFile.close() else: inFile = open(options.result_file, "r") uitFile = open(options.output_file, "w") for l in inFile: l = l.rstrip().split("\t") if "genes" in l[0]: l = [x.replace(options.working_dir, "") for x in l] # remove path from sample names l.append("\t".join([clusters["gene"]])) elif l[0].lower() in clusters: l.append(clusters[l[0].lower()]) else: l.append("NaN\tNaN") uitFile.write("\t".join(l)+"\n") inFile.close() uitFile.close() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 | import numpy as np from optparse import OptionParser # get the command line arguments parser = OptionParser(description="python script to create pvalue and log2(foldchange) filtered ") parser.add_option('-i', '--input_file', type=str, default="results/plotSelection.txt", metavar="", help = "path and name of of the input file, being the output file of the R-script DESeq2.R, default = results/plotSelection.txt") parser.add_option('-f', '--helper_file', type=str, default="result/helperFile.csv", metavar="", help = "path and name of the helper file, a tab delimited file containing one column samples and a column conditions, default = result/helperFile.csv") parser.add_option('-o', '--output_file', type=str, default="result/plotSelection.txt", metavar="", help = "path and name of the output file a tab delimimited file containing normalised reads of significantly differentially expressed genes of all the samples or averaged to the conditions, default = result/plotSelection.txt") parser.add_option('-v', '--minimum_foldchange', type=float, default=2, metavar="", help = "minimum log2(fold_change) in any combination of conditions; integer > 1, default = 2") parser.add_option('-r', '--minimum_reads', type=int, default=100, metavar="", help = "minimum number of reads of all samples together; integer >= 0, default = 100") parser.add_option('-p', '--maximum_pvalue', type=float, default=0.05, metavar="", help = "maximum exepted adjusted pvalue; 0.0 to 1.0, default = 0.05") parser.add_option('-a', '--average_samples', type=str, default="yes", metavar="", help = "output needs to contain averages of conditions; yes or no, default = yes") (options, args) = parser.parse_args() # get a list of the conditions helper = open(options.helper_file, "r") samples = {} conditions = [] x=0 for l in helper: l = l.rstrip() if len(l) > 0: x += 1 l = l.split(",") name = l[1] samples[x] = name if name not in conditions: conditions.append(name) helper.close() # function to average the samples within a condition def getAverages(counts): global samples, conditions sets = {} averages = [counts[0]] for i in samples: if samples[i] in sets: sets[samples[i]].append(float(counts[i])) else: sets[samples[i]] = [float(counts[i])] for c in conditions: averages.append(str(np.mean(sets[c]))) return(averages) inputData = open(options.input_file, "r") outAll = open(options.output_file, "w") totalBoth = 0 count = 0 min_reads = options.minimum_reads foldchange= options.minimum_foldchange pvalue = options.maximum_pvalue avarage = options.average_samples.upper() for l in inputData: count += 1 if l.startswith("genes") == False and len(l) > 5: l = l.replace("NA", "1") # replace NA values in P-value to 1 l = l.replace("#VALUE!", "0") # replace NA values in foldchange to 0 f = l.rstrip() # remove white space and line break at the end. f = f.split("\t") # split string to list on tab if len(f) > x+1: fc = [float(f[i]) for i in range(x+2, len(f), 6)] # list of faultchanges pv = [float(f[i]) for i in range(x+6, len(f), 6)] # list of adjusted p-values counts = [float(f[i]) for i in range(1,x+1)] if sum(counts) > min_reads: # total number of reads more then ~20 per sample for a1, a2 in zip(fc,pv): if a1 > foldchange and a2 < pvalue and a2 == min(pv): if avarage == "YES": line = "\t".join(getAverages(f[:x+1])) else: line = "\t".join(f[:x+1]) outAll.write(line+"\n") totalBoth += 1 break elif a1 < -1*foldchange and a2 < pvalue and a2 == min(pv): if avarage == "YES": line = "\t".join(getAverages(f[:x+1])) else: line = "\t".join(f[:x+1]) outAll.write(line+"\n") totalBoth += 1 break else: print("No fold changes and p-values were found. Try rerunning the DESeq2.R with a higher maximum fraction (to be adjusted in config.yaml)") else: if len(l) > 10: if avarage == "YES": line = "\t".join(conditions) else: line = "\t".join(l.split("\t")[1:x+1]) outAll.write(l.split("\t")[0] + "\t" + line + "\n") print(f"Total number of genes is: {str(count-1)}.") print(f"Number of genes kept for plots is: {str(totalBoth)}.") inputData.close() outAll.close() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 | library(optparse) option_list = list( make_option(c("-i", "--input_file"), type="character", default="results/counts.txt", help="normalized and filtered counts tabulated file from filter_for_plots.py", metavar="character"), make_option(c("-m", "--method_of_clustering"), type="character", default="hierarchical", help="method of clustering chose between: hierarchical, kmean or fuzzy_kmean default is hierarchical", metavar="character"), make_option(c("-n", "--opt_clust_number"), type="character", default="dynamic", help="determination of optimal number of clusters, If --method_of_clustering=hierarchical the options are: fixed, height or dynamic. If--method_of_clustering=kmean or --method_of_clustering=fuzzy_kmean the options are: fixed, average_silhouette_width, calinsky_criterion, gap_statistic or affinity_propogation. default is dynamic.", metavar="character"), make_option(c("-k", "--number_of_clusters"), type="integer", default=5, help="number of clusters to be used. choose an integer of two or more. To be used if --det_hclust_number=fixed or --det_kclust_number=fixed. default is 5", metavar="integer"), make_option(c("-H", "--height_in_dendrogram"), type="double", default=1.5, help="height in dendrogram to use for determination of cluster number. default is 1.5", metavar="double"), make_option(c("-q", "--membership_min"), type="double", default=0.2, help="minimal membership value of fuzzy kmean cluster to be included in the cluster. should be a value between 0.0 and 1.0. Only to be used in combination with -m = fuzzy_kmean. Best to first run with a low value, and adjust based on results if needed. default is o.2", metavar="double"), make_option(c("-r", "--correlation_min"), type="double", default=0.5, help="minimal correlation value to a kmean cluster to be included in the cluster, if too low it will be excluded from all clusters!. should be a value between 0.0 and 1.0. Only to be used in combination with -m = kmean or -m = pam. Best to first run with a low value, and adjust based on results if needed. default is 0.5", metavar="double"), make_option(c("-c", "--colour_of_heatmaps"), type="character", default=c("white","green","green4","violet","purple"), help="colors of the heatmaps. needs to be a list of two or more colors. To see a list of all 657 colour names in R: colors() default is the colourblind friendly: c(\"white\",\"green\",\"green4\",\"violet\",\"purple\")", metavar="character"), make_option(c("-p", "--plots_output_file"), type="character", default="results/plots.pdf", help="name of multipage pdf file where to output all the plots.", metavar="character"), make_option(c("-o", "--clusters_output_file"), type="character", default="results/plots.pdf", help="name of multipage pdf file where to output all the plots.", metavar="character") ) # parse the command-line arguments and pass them to a list called 'opt' opt_parser = OptionParser(option_list=option_list , add_help_option = TRUE, description = "\nThis script produces multiple plots and clusters from a RNAseq output from the filter_for_plots.py. plots include dendrograms of the samples and the genes, a heatmap, an elbow plot, a heatmap with clusters specicied by colorbar, and finally plots and heatmaps of the different clusters. Because the 'correct' method for clustering RNAseq data is a matter of perspective; it is the one that allows the researcher to make the most out of her data. Gene expression data is also full of noise which can make clustering tricky when using algorithms optimised for chunkier data. With that in mind it's good to try several methods and compare them. So for that multiple ways of clusting can be choosen also the number of clusters can be manually choosen or calculated by a number of given methods. When making use of the snakemake pipeline, it is essential to remove or rename the output file (plots.pdf), change the arguments you want to change in the config.yaml, and rerun the pipeline with 'snakemake --use-conda' (to be sure only the last script reruns, start of with 'snakemake -np').", epilogue = "Pease feel free to mail me, bliek@uva.nl, if you would like more info or have suggestions for optimization.\n\n"); opt = parse_args(opt_parser); i <- opt$input_file m <- opt$method_of_clustering n <- opt$opt_clust_number k <- opt$number_of_clusters h <- opt$height_in_dendrogram c <- opt$colour_of_heatmaps q <- opt$membership_min r <- opt$correlation_min p <- opt$plots_output_file o <- opt$clusters_output_file ###################################################################### ############## import needed packages ############### ###################################################################### library(gplots) library(dendextend) library(dynamicTreeCut) #library("pheatmap") library(cluster) library(vegan) library(apcluster) #library(ggplot2) library(reshape2) library(RColorBrewer) library(dplyr) library(colorRamps) library(e1071) library(tidyr) library(ggplot2) ###################################################################### ############### read the data ################# ###################################################################### data <- read.delim(i, header=T, row.names="genes") z <- as.matrix((data)) # scaling the data to avoid clustering based on expression level scaledata <- t(scale(t(z))) # Centers and scales data. scaledata <- scaledata[complete.cases(scaledata),] # open pdf for the plots pdf(file=p) #mean/variance calculations and plot z_var <- apply(z, 1, var) z_mean <- apply(z, 1, mean) plot(log2(z_mean), log2(z_var), pch='.') ############################################################## ############ hierarchical dendrogram ################# ############################################################## # dendrogram of the samples hc <- hclust(as.dist(1-cor(scaledata, method="spearman")), method="complete") # Clusters columns by Spearman correlation. TreeC = as.dendrogram(hc, method="average") plot(TreeC, main = "Sample Clustering", ylab = "Height") # dendrogram of the genes hr <- hclust(as.dist(1-cor(t(scaledata), method="pearson")), method="complete") # Cluster rows by Pearson correlation. TreeR = as.dendrogram(hr, method="average") plot(TreeR, leaflab = "none", main = "Gene Clustering", ylab = "Height") ############################################################# ############## create Heatmap ################## ############################################################# # colors of the heatmap my_palette <- colorRampPalette(c)(100) #(c("magenta", "black", "green"))(n = 299) # plot the heatmap heatmap.2(z, Rowv=as.dendrogram(hr), Colv=NA, col=my_palette, scale="row", margins = c(7, 7), cexCol = 0.7, labRow = F, dendrogram = "row", main = "Heatmap.2", trace = "none") ################################################################# ####### Elbow plot ( the sum of squared error (SSE)) ####### ################################################################# wss <- (nrow(scaledata)-1)*sum(apply(scaledata,2,var)) for (i in 2:20) wss[i] <- sum(kmeans(scaledata, centers=i)$withinss) plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") ############################################################## ############## Do the clustering ############### ############################################################## if(m == "kmean" || m == "fuzzy_kmean" || m == "pam"){ if(n == "average_silhouette_width"){ # method 1: average silhouette width sil <- rep(0, 20) #repeat k-means for 1:20 and extract silhouette: for(i in 2:20){ k1to20 <- kmeans(scaledata, centers = i, nstart = 25, iter.max = 20) ss <- silhouette(k1to20$cluster, dist(scaledata)) sil[i] <- mean(ss[, 3]) } # Plot the average silhouette width plot(1:20, sil, type = "b", pch = 19, xlab = "Number of clusters k", ylab="Average silhouette width") abline(v = which.max(sil), lty = 2) k <- which.max(sil) } else if(n == "calinsky_criterion"){ # method 2: Calinsky criterion fit <- cascadeKM(scaledata, 1, 20, iter = 100) plot(fit, sortg = TRUE, grpmts.plot = TRUE) calinski.best <- as.numeric(which.max(fit$results[2,])) cat("Calinski criterion optimal number of clusters:", calinski.best, "\n") k <- calinski.best } else if(n == "gap_statistic"){ # method 3: Gap statistic set.seed(13) gap <- clusGap(scaledata, kmeans, 20, B = 100, verbose = interactive()) plot(gap, main = "Gap statistic") abline(v=which.max(gap$Tab[,3]), lty = 2) k <- which.max(gap$Tab[,3]) } else if(n == "affinity_propogation"){ # method 4: Affinity propogation d.apclus <- apcluster(negDistMat(r=2), scaledata) cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "\n") #uncomment the next line for the heatmap, it takes a long time to run #heatmap(d.apclus,cexRow=0, cexCol=0) k <- length(d.apclus@clusters) } else if(n == "fixed"){ k <- k }else{ print("your choosen method of determining the optimal number of clusters is unclear.") print("When using --method_of_clustering=kmean or --method_of_clustering=fuzzy_kmean ") print("the options are:fixed, average_silhouette_width, calinsky_criterion,") print("gap_statistic or affinity_propogation.") } print(paste0("Dataset is clustered in ",k, " clusters ")) if(m == "kmean"){ set.seed(20) clusts <- kmeans(scaledata, centers=k, nstart = 1000, iter.max = 20) clusts <- clusts$cluster } else if(m == "fuzzy_kmean"){ mestimate<- function(df){ N <- dim(df)[[1]] D <- dim(df)[[2]] m.sj <- 1 + (1418/N + 22.05)*D^(-2) + (12.33/N +0.243)*D^(-0.0406*log(N) - 0.1134) return(m.sj) } mes <- mestimate(scaledata) fcm_results <- cmeans(scaledata,centers=k,m=mes) fcm_plotting_df <- data.frame(scaledata) fcm_plotting_df$gene <- row.names(fcm_plotting_df) fcm_plotting_df$cluster <- fcm_results$cluster fcm_plotting_df$membership <- sapply(1:length(fcm_plotting_df$cluster),function(row){ clust <- fcm_plotting_df$cluster[row] fcm_results$membership[row,clust] }) clusts <- fcm_plotting_df$cluster names(clusts) <- fcm_plotting_df$gene # get the core data fcm_centroids <- fcm_results$centers fcm_centroids_df <- data.frame(fcm_centroids) fcm_centroids_df$cluster <- row.names(fcm_centroids_df) centroids_long <- tidyr::gather(fcm_centroids_df,"sample",'value', 1:ncol(scaledata)) #start with the input data fcm_plotting_df <- data.frame(scaledata) #add genes fcm_plotting_df$gene <- row.names(fcm_plotting_df) #bind cluster assinment fcm_plotting_df$cluster <- fcm_results$cluster #fetch the membership for each gene/top scoring cluster fcm_plotting_df$membership <- sapply(1:length(fcm_plotting_df$cluster),function(row){ clust <- fcm_plotting_df$cluster[row] fcm_results$membership[row,clust] }) # filter out genes don't really belong to any of the clusters selection <- fcm_plotting_df %>% filter(membership > q) } else if(m == "pam"){ pam.Pclusts <- pam(scaledata, k=k) clusts <- pam.Pclusts$'clustering' } }else if(opt$method_of_clustering == "hierarchical"){ if(n == "dynamic"){ clusts <- cutreeDynamic(hr, distM = as.matrix(as.dist(1-cor(t(scaledata)))), method = "hybrid") names(clusts) <- rownames(scaledata) clust <- as.data.frame(clusts) colnames(clust) <- "cluster" k = length(unique(clusts)) print(paste0("Dataset is clustered in ",k, " clusters ")) } else if(n == "fixed"){ clusts = cutree(hr, k=k) k = length(unique(clusts)) print(paste0("Dataset is clustered in ",k, " clusters ")) } else if(n == "height"){ clusts = cutree(hr, h=h) k = length(unique(clusts)) print(paste0("Dataset is clustered in ",k, " clusters ")) } else{ print("your choosen method of determining the optimal number of clusters is unclear.\ When using --method_of_clustering=hierarchical the options are: fixed, height or dynamic.") } }else{ print("your choosen method of clustering is unclear. It should be one of the following: hierarchical, kmean or fuzzy_kmean") } ################################################################### ############ Heatmap with clusterbar ############## ################################################################### mycolhc <- rainbow(length(unique(clusts)), start=0.1, end=0.9) mycolhc <- mycolhc[as.vector(clusts)] hr <- hclust(as.dist(1-cor(t(scaledata), method="pearson")), method="complete") heatmap.2(z, Rowv=as.dendrogram(hr), Colv=NA, col=my_palette, scale="row", margins = c(7, 7), cexCol = 0.7, labRow = F, dendrogram = "row", main = paste0("Heatmap.2 with colour bar indicating the clusters (",k,")"), trace = "none", RowSideColors=mycolhc, key = FALSE) ############################################################################### ######## calculate the cluster 'cores' aka centroids and plot them ######## ############################################################################### # get the cluster centroids (the average profile of expression for each of the clusters) clust.centroid = function(i, dat, clusters) { ind = (clusters == i) colMeans(dat[ind,]) } clusUniq <- unique(clusts) kClustcentroids <- sapply(levels(factor(clusts)), clust.centroid, scaledata, clusts) head(kClustcentroids) Kmolten <- melt(kClustcentroids) colnames(Kmolten) <- c('sample','cluster','value') # In case of use of pam, the cluster centroids wil be the medoids (the expression profiles of the genes that fit the clusters best) if (m == "pam"){ pClustcentroids <- t(pam.Pclusts$"medoids") Kmolten <- melt(pClustcentroids) colnames(Kmolten) <- c('sample','cluster','value') } p1 <- ggplot(Kmolten, aes(x=sample,y=value, group=cluster, colour=as.factor(cluster))) + geom_point() + geom_line() + xlab("Time") + ylab("Expression") + labs(title= "Cluster Expression of the samples",color = "Cluster") p1 # in case of pam, change names of the clusters to numbers in stead of the medoids gene names if (m == "pam"){ colnames(pClustcentroids) <- clusUniq Kmolten <- melt(pClustcentroids) colnames(Kmolten) <- c('sample','cluster','value') } ########################################################################### ############# heatmaps and plots of the clusters ################ ########################################################################### # open list to collect the correlation values in case of kmean, fuzzy_kmean and pam cors <- c() # loop though the clusters and create expression profile plots and heatmap for each of them if(m == "fuzzy_kmean"){ for(i in 1:k){ print(i) cluster_plot_df <- dplyr::filter(selection, cluster == i) %>% dplyr::select(.,1:ncol(scaledata),membership,gene) %>% tidyr::gather(.,"sample",'value',1:ncol(scaledata)) #order the dataframe by score cluster_plot_df <- cluster_plot_df[order(cluster_plot_df$membership),] #set the order by setting the factors using forcats cluster_plot_df$gene = forcats::fct_inorder(cluster_plot_df$gene) #subset the cores by cluster core <- dplyr::filter(centroids_long, cluster == i) g1 <- ggplot(cluster_plot_df, aes(x=sample,y=value)) + geom_line(aes(colour=membership, group=gene)) + scale_colour_gradientn(colours=c('blue1','red2')) + #this adds the core geom_line(data=core, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) + geom_point(data=core, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) + xlab("Sample") + ylab("Expression") + labs(title= paste0("Cluster ",i," Expression of the Samples"),color = "Score") print(g1) group <- selection %>% filter(selection$cluster==i) rownames(group) <- group$gene group <- as.matrix(group[,1:ncol(scaledata)]) if(nrow(group) > 1){ hrc <- hclust(as.dist(1-cor(t(group), method="pearson")), method="complete") #pheatmap(group, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(breaksList)), breaks = breaksList, cluster_cols = F) #cellheight = 8 heatmap.2(group, Rowv=as.dendrogram(hrc), Colv=NA, col= magenta2green(100), labRow=rownames(group), scale="row", margins = c(7, 7), cexCol = 0.7, dendrogram = "row", main = paste0("Cluster ",i, " consisting ", nrow(group), " genes"), trace = "none",key = F) } } }else if(m == "kmean" || m == "hierarchical" || m == "pam"){ for(i in clusUniq){ K2 <- (scaledata[clusts==i,]) if(nrow(K2) > 1){ #calculate the correlation with the core core <- Kmolten[Kmolten$cluster==i,] corscore <- function(x){cor(x,core$value)} score <- apply(K2, 1, corscore) # add the correlations to the list cors <- c(cors, score) #get the data frame into long format for plotting K2molten <- melt(K2) colnames(K2molten) <- c('gene','sample','value') #add the score K2molten <- merge(K2molten,score, by.x='gene',by.y='row.names', all.x=T) colnames(K2molten) <- c('gene','sample','value','score') #order the dataframe by score #to do this first create an ordering factor K2molten$order_factor <- 1:length(K2molten$gene) #order the dataframe by score K2molten <- K2molten[order(K2molten$score),] #set the order by setting the factors K2molten$order_factor <- factor(K2molten$order_factor , levels = K2molten$order_factor) # Everything on the same plot if(m == "kmean" || m =="pam"){ p2 <- ggplot(K2molten, aes(x=sample,y=value)) + geom_line(aes(colour=score, group=gene)) + scale_colour_gradientn(colours=c('blue1','red2')) + #this adds the core geom_line(data=core, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) + geom_point(data=core, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) + xlab("Samples") + ylab("Expression") + labs(title= paste0("Cluster ",i, " consisting ", nrow(K2), " genes"),color = "Score") } else if(m == "hierarchical"){ p2 <- ggplot(K2molten, aes(x=sample,y=value)) + geom_line(color="grey", aes(color="grey", group=gene)) + #this adds the core geom_line(data=core, aes(sample,value, group=cluster), color="blue",inherit.aes=FALSE) + geom_point(data=core, aes(sample,value, group=cluster), color="blue",inherit.aes=FALSE) + xlab("Samples") + ylab("Expression") + labs(title= paste0("Cluster ",i, " consisting ", nrow(K2), " genes"),color = "Score") } print(p2) hrc <- hclust(as.dist(1-cor(t(K2), method="pearson")), method="complete") #pheatmap(group, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(breaksList)), breaks = breaksList, cluster_cols = F) #cellheight = 8 heatmap.2(K2, Rowv=as.dendrogram(hrc), Colv=NA, col= my_palette, labRow=rownames(K2), scale="row", margins = c(7, 7), cexCol = 0.7, dendrogram = "row", main = paste0("Cluster ",i, " consisting ", nrow(K2), " genes"), trace = "none",key = F) } } } dev.off() ######################################################################################### ############# create output file containing the clusters ################# ######################################################################################### # create data frame containing cluster and if available correlation values or memberships if(m == "fuzzy_kmean"){ # fuzzy_kmeans lijst <- data.frame(selection$gene, selection$cluster, selection$membership ) colnames(lijst) <- c("gene", "clusters", "membership") }else if(m == "kmean" || m == "pam"){ # kmean and pam lijst <- merge(as.data.frame(clusts), as.data.frame(cors), by="row.names", sort=FALSE) colnames(lijst) <- c("gene", "clusters", "correlation") lijst <- lijst %>% filter(correlation > r) }else{ # hierarchical lijst <- data.frame(rownames(as.data.frame(clusts)), clusts) colnames(lijst) <- c("gene", "clusters") } # Write the dataframe as a tab-delimited file write.table(lijst, file=o, sep = "\t",quote=F,row.names=F) |
97 98 | shell: "wget -O {output} {genome_url}" |
107 108 | shell: "wget -O {output} {transcriptome_gtf_url}" |
129 130 131 132 133 134 135 136 137 138 139 140 141 | run: if sample_is_single_end(params.sampleName): shell("fastp --thread {threads} --html {output.html} \ --qualified_quality_phred {params.qualified_quality_phred} \ --in1 {input} --out1 {output} \ 2> {log}; \ touch {output.fq2}") else: shell("fastp --thread {threads} --html {output.html} \ --qualified_quality_phred {params.qualified_quality_phred} \ --detect_adapter_for_pe \ --in1 {input[0]} --in2 {input[1]} --out1 {output.fq1} --out2 {output.fq2}; \ 2> {log}") |
156 157 | shell: "hisat2-build -p {threads} {input} {params} --quiet" |
175 176 177 178 179 180 181 | run: if sample_is_single_end(params.sampleName): shell("hisat2 -p {threads} --summary-file {output.sum} --met-file {output.met} -x {params.indexName} \ -U {input} | samtools view -Sb -F 4 -o {output.bams}") else: shell("hisat2 -p {threads} --summary-file {output.sum} --met-file {output.met} -x {params.indexName} \ -1 {input[0]} -2 {input[1]} | samtools view -Sb -F 4 -o {output.bams}") |
197 198 | shell: "featureCounts -O -t mRNA -g ID -F 'gtf' -a {input.gff} -o {output} {input.bams}" |
218 219 | shell: "Rscript scripts/DESeq2.R -c {input.counts} -s {input.samplefile} -o {output.result} -m {params.maxfraction} -f {output.helper}" |
232 233 234 235 236 237 238 | shell: "python scripts/DE_with_Function.py " "-a {params.annos} " "-c {input.clusts} " "-r {input.deseq} " "-o {output.final} " "-p {params.path}" |
256 257 258 259 260 261 262 263 264 | shell: "python scripts/filterForPlots.py " "-i {input.result} " "-f {input.helper} " "-o {output} " "-v {params.minimum_foldchange} " "-r {params.minimum_reads} " "-p {params.maximum_pvalue} " "-a {params.average_samples}" |
281 282 283 284 285 286 287 288 | shell: "Rscript scripts/plotscript.R " "-i {input} " "-m {params.method_of_clustering} " "-n {params.opt_clust_number} " "-k {params.number_of_clusters} " "-H {params.height_in_dendrogram} " "-q {params.membership_min} " |
Support
- Future updates
Related Workflows





