A single cell RNA-seq workflow, including highly variable gene analysis, cell type assignment and differential expression analysis.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A single cell RNA-seq workflow following Lun, McCarthy and Marioni 2016 and Soneson and Robinson 2018 , with added more recent functionality.
Authors
- Johannes Köster, https://koesterlab.github.io
Usage
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if already available, its DOI (see above).
Step 1: Obtain a copy of this workflow
-
Create a new github repository using this workflow as a template .
-
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
Step 2: Configure workflow
Configure the workflow according to your needs via editing the file
config.yaml
.
Step 3: Execute workflow
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.
Step 4: Investigate results
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .
Step 5: Commit changes
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
Step 6: Obtain updates from upstream
Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.
-
Once, register the upstream repository in your local copy:
git remote add -f upstream git@github.com:snakemake-workflows/single-cell-rna-seq.git
orgit remote add -f upstream https://github.com/snakemake-workflows/single-cell-rna-seq.git
if you do not have setup ssh keys. -
Update the upstream version:
git fetch upstream
. -
Create a diff with the current version:
git diff HEAD upstream/master workflow > upstream-changes.diff
. -
Investigate the changes:
vim upstream-changes.diff
. -
Apply the modified diff via:
git apply upstream-changes.diff
. -
Carefully check whether you need to update the config files:
git diff HEAD upstream/master config
. If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.
Step 7: Contribute back
In case you have also changed or added steps, please consider contributing them back to the original repository:
-
Fork the original repo to a personal or lab account.
-
Clone the fork to your local system, to a different place than where you ran your analysis.
-
Copy the modified files from your analysis to the clone of your fork, e.g.,
cp -r workflow path/to/fork
. Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary. -
Commit and push your changes to your fork.
-
Create a pull request against the original repository.
Testing
Test cases are in the subfolder
.test
. They are automtically executed via continuous integration with Travis CI.
Code Snippets
18 19 | script: "../scripts/cell-cycle.R" |
36 37 | script: "../scripts/cell-cycle-scores.R" |
25 26 | script: "../scripts/cellassign.R" |
42 43 | script: "../scripts/plot-cellassign.R" |
67 68 | script: "../scripts/celltype-tsne.R" |
89 90 | script: "../scripts/plot-gene-expression.R" |
18 19 | script: "../scripts/load-counts.R" |
35 36 | script: "../scripts/diffexp.R" |
55 56 | script: "../scripts/plot-differential-expression.R" |
21 22 | script: "../scripts/filter-cells.R" |
23 24 | script: "../scripts/normalize.R" |
40 41 | script: "../scripts/batch-effect-removal.R" |
35 36 | script: "../scripts/qc.R" |
53 54 | script: "../scripts/explained-variance.R" |
77 78 | script: "../scripts/plot-gene-gene-expression.R" |
96 97 | script: "../scripts/gene-tsne.R" |
38 39 | script: "../scripts/hvg.R" |
69 70 | script: "../scripts/hvg-correlation.R" |
89 90 | script: "../scripts/hvg-pca.R" |
111 112 | script: "../scripts/hvg-tsne.R" |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) library(limma) sce <- readRDS(snakemake@input[["sce"]]) cycle.assignments <- readRDS(snakemake@input[["cycles"]]) colData(sce)$G1 <- cycle.assignments$scores$G1 colData(sce)$G2M <- cycle.assignments$scores$G2M # setup design matrix model <- snakemake@params[["model"]] design <- model.matrix(as.formula(model), data=colData(sce)) # store design matrix saveRDS(design, file=snakemake@output[["design_matrix"]]) # remove batch effects based on variables batch.removed <- removeBatchEffect(logcounts(sce), covariates=design) logcounts(sce) <- batch.removed saveRDS(sce, file=snakemake@output[["sce"]]) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | Sys.setenv(MKL_THREADING_LAYER = "GNU") log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(SingleCellExperiment) library(tensorflow) library(cellassign) library(ComplexHeatmap) library(viridis) library(ggsci) is.float <- function(x) { (typeof(x) == "double") && (x %% 1 != 0) } sce <- readRDS(snakemake@input[["sce"]]) parent <- snakemake@wildcards[["parent"]] # get parent fit and filter sce to those cells parent_fit <- snakemake@input[["fit"]] if(length(parent_fit) > 0) { parent_fit <- readRDS(parent_fit)$cell_type is_parent_type <- rownames(parent_fit)[parent_fit$cell_type == parent] sce <- sce[, is_parent_type] } markers <- read.table(snakemake@input[["markers"]], row.names = NULL, header = TRUE, sep="\t", stringsAsFactors = FALSE, na.strings = "") markers[is.na(markers$parent), "parent"] <- "root" markers <- markers[markers$parent == parent, ] # convert markers into something cellAssign understands trim <- function (x) gsub("^\\s+|\\s+$", "", x) get_genes <- function (x) { if(is.na(x)) { vector() } else { sapply(strsplit(x, ","), trim) } } genes <- vector() for(g in markers$genes) { genes <- c(genes, get_genes(g)) } genes <- sort(unique(genes)) if(length(genes) < 1) { stop("Markers have to contain at least two different genes in union.") } marker_mat <- matrix(0, nrow = nrow(markers), ncol = length(genes)) colnames(marker_mat) <- genes rownames(marker_mat) <- markers$name for(i in 1:nrow(markers)) { cell_type <- markers[i, "name"] marker_mat[cell_type, ] <- genes %in% get_genes(markers[i, "genes"]) } marker_mat <- t(marker_mat) marker_mat <- marker_mat[rownames(marker_mat) %in% rownames(sce),, drop=FALSE] # apply cellAssign sce <- sce[rownames(marker_mat), ] # remove genes with 0 counts in all cells and cells with 0 counts in all genes sce <- sce[rowSums(counts(sce)) != 0, colSums(counts(sce)) != 0] # obtain batch effect model model <- readRDS(snakemake@input[["design_matrix"]]) # constrain to selected cells and remove intercept (not allowed for cellassign) model <- model[colnames(sce), colnames(model) != "(Intercept)"] # normalize float columns (as recommended in cellassign manual) float_cols <- apply(model, 2, is.float) model[, float_cols] <- apply(model[, float_cols], 2, scale) if(nrow(sce) == 0) { stop("Markers do not match any gene names in the count matrix.") } # fit fit <- cellassign(exprs_obj = sce, marker_gene_info = marker_mat, s = sizeFactors(sce), learning_rate = 1e-2, B = 20, shrinkage = TRUE, X = model) # add cell names to results cells <- colnames(sce) rownames(fit$mle_params$gamma) <- cells fit$cell_type <- data.frame(cell_type = fit$cell_type) rownames(fit$cell_type) <- cells saveRDS(fit, file = snakemake@output[["fit"]]) save.image() # plot heatmap source(file.path(snakemake@scriptdir, "common.R")) sce <- assign_celltypes(fit, sce, snakemake@params[["min_gamma"]]) pdf(file = snakemake@output[["heatmap"]]) pal <- pal_d3("category20")(ncol(marker_mat)) names(pal) <- colnames(marker_mat) celltype <- HeatmapAnnotation(df = data.frame(celltype = colData(sce)$celltype), col = list(celltype = pal)) Heatmap(logcounts(sce), col = viridis(100), clustering_distance_columns = "canberra", clustering_distance_rows = "canberra", use_raster = TRUE, show_row_dend = FALSE, show_column_dend = FALSE, show_column_names = FALSE, top_annotation = celltype, name = "logcounts") dev.off() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) library(scran) sce <- readRDS(snakemake@input[[1]]) # determine cell cycle species = snakemake@params[["species"]] if (species == "mouse") { markers <- "mouse_cycle_markers.rds" } else if (species == "human") { markers <- "human_cycle_markers.rds" } else { stop("Unsupported species. Only mouse and human are supported.") } # read trained data pairs <- readRDS(system.file("exdata", markers, package="scran")) # obtain assignments assignments <- cyclone(sce, pairs, gene.names=rowData(sce)$ensembl_gene_id) # store assignments saveRDS(assignments, file=snakemake@output[[1]]) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(RColorBrewer) assignments <- readRDS(snakemake@input[["rds"]]) annotation <- read.table(snakemake@input[["cells"]], header=TRUE, row.names=1) covariate <- gsub("-", ".", snakemake@wildcards[["covariate"]]) covariate.values <- unique(annotation[, covariate]) n <- length(covariate.values) # plot pdf(file=snakemake@output[[1]]) # set color palette palette(brewer.pal(n=n, name="Dark2")) plot(assignments$score$G1, assignments$score$G2M, xlab="G1 score", ylab="G2/M score", pch=16, col=annotation[, covariate]) legend("topright", legend=covariate.values, col=palette()[covariate.values], pch=16) dev.off() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) library(scran) library(ggsci) source(file.path(snakemake@scriptdir, "common.R")) seed <- as.integer(snakemake@wildcards[["seed"]]) target_parent <- snakemake@wildcards[["parent"]] parents <- snakemake@params[["parents"]] fits <- snakemake@input[["fits"]] sce <- readRDS(snakemake@input[["sce"]]) for(i in 1:length(fits)) { cellassign_fit <- fits[i] parent <- parents[i] cellassign_fit <- readRDS(cellassign_fit) sce <- assign_celltypes(cellassign_fit, sce, snakemake@params[["min_gamma"]]) if(parent == target_parent) { break } } style <- theme( axis.text=element_text(size=12), axis.title=element_text(size=16)) # plot t-SNE pdf(file=snakemake@output[[1]], width = 12) set.seed(seed) plotTSNE(sce, colour_by="celltype") + scale_fill_d3(alpha = 1.0) + scale_colour_d3(alpha = 1.0) + style dev.off() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(tidyverse) library(SingleCellExperiment) library(scran) library(edgeR) library(dplyr) source(file.path(snakemake@scriptdir, "common.R")) sce <- readRDS(snakemake@input[["sce"]]) for(cellassign_fit in snakemake@input[["cellassign_fits"]]) { cellassign_fit <- readRDS(cellassign_fit) sce <- assign_celltypes(cellassign_fit, sce, snakemake@params[["min_gamma"]]) } # handle constrain celltypes constrain_celltypes <- snakemake@params[["constrain_celltypes"]] if(!is.null(constrain_celltypes)) { celltypes <- constrain_celltypes[["celltypes"]] common_var <- constrain_celltypes[["common"]] sce <- sce[, colData(sce)$celltype %in% celltypes] if(!is.null(common_var)) { if(!(common_var %in% colnames(colData(sce)))) { stop(paste("covariate", common_var, "not found in cell metadata")) } is_common_in_all <- apply(table(colData(sce)[, c(common_var, "celltype")]) > 0, 1, all) common_in_all <- names(is_common_in_all)[is_common_in_all] sce <- sce[, colData(sce)[, common_var] %in% common_in_all] colData(sce) <- droplevels(colData(sce)) } } # only keep the requested cells if(length(snakemake@params[["celltypes"]]) > 0) { sce <- sce[, colData(sce)$celltype %in% snakemake@params[["celltypes"]]] } colData(sce)$celltype <- factor(colData(sce)$celltype) colData(sce)$detection_rate <- cut(colData(sce)$detection_rate, 10) # convert to edgeR input y <- convertTo(sce, type = "edgeR", col.fields = colnames(colData(sce))) # obtain design matrix design <- model.matrix(as.formula(snakemake@params[["design"]]), data=y$samples) y <- calcNormFactors(y) y <- estimateDisp(y, design) fit <- glmQLFit(y, design) qlf <- glmQLFTest(fit, coef = snakemake@params[["coef"]]) saveRDS(y, file = snakemake@output[["edger_dge"]]) write_tsv(rownames_to_column(topTags(qlf, n = 100000)$table, var = "gene"), snakemake@output[["table"]]) pdf(file = snakemake@output[["bcv"]]) plotBCV(y) dev.off() pdf(file = snakemake@output[["md"]]) plotMD(qlf) abline(h = c(-1,1), col = "grey") dev.off() pdf(file = snakemake@output[["disp"]]) plotQLDisp(fit) dev.off() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) library(scran) sce <- readRDS(snakemake@input[["rds"]]) annotation <- read.table(snakemake@input[["cells"]], row.names=1, header=TRUE) pdf(file=snakemake@output[[1]]) plotExplanatoryVariables(sce, variables=c("total_counts_Spike", "log10_total_counts_Spike", colnames(annotation))) + theme( axis.text=element_text(size=12), axis.title=element_text(size=16) ) dev.off() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) sce <- readRDS(snakemake@input[[1]]) # drop cells with too few counts or expressed features libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE) feature.drop <- isOutlier(sce$total_features_by_counts, nmads=3, type="lower", log=TRUE) # drop cells with too high proportion of mito genes or spike-ins expressed mito.drop <- isOutlier(sce$pct_counts_Mt, nmads=3, type="higher") spike.drop <- isOutlier(sce$pct_counts_Spike, nmads=3, type="higher") # write stats stats <- data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop), ByMito=sum(mito.drop), BySpike=sum(spike.drop), Remaining=ncol(sce)) write.table(stats, file=snakemake@output[["stats"]], sep='\t', row.names=FALSE, quote=FALSE) # filter sce sce <- sce[,!(libsize.drop | feature.drop | mito.drop | spike.drop)] saveRDS(sce, file=snakemake@output[["rds"]]) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) library(scran) library(ggsci) library(viridis) seed <- as.integer(snakemake@wildcards[["seed"]]) gene <- snakemake@wildcards[["gene"]] sce <- readRDS(snakemake@input[["sce"]]) colData(sce)[, gene] <- logcounts(sce)[gene, ] style <- theme( axis.text=element_text(size=12), axis.title=element_text(size=16)) # plot t-SNE pdf(file=snakemake@output[[1]], width = 12) set.seed(seed) plotTSNE(sce, colour_by=gene) + scale_fill_viridis(alpha = 1.0) + scale_colour_viridis(alpha = 1.0) + style dev.off() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) library(scran) library(RBGL) library(Rgraphviz) library(gplots) fdr <- snakemake@params[["fdr"]] sce <- readRDS(snakemake@input[["sce"]]) hvgs <- read.table(snakemake@input[["hvg"]], row.names=1) topn <- min(nrow(hvgs), snakemake@params[["top_n"]]) hvgs <- hvgs[1:topn, ] # find correlated pairs set.seed(100) var.cor <- correlatePairs(sce, subset.row=rownames(hvgs)) sig.cor <- var.cor$FDR <= fdr write.table(file=snakemake@output[["corr"]], var.cor, sep="\t", quote=FALSE, row.names=FALSE) # define graph with significant correlation as edges g <- ftM2graphNEL(cbind(var.cor$gene1, var.cor$gene2)[sig.cor,], W=NULL, V=NULL, edgemode="undirected") # find highly connected clusters cl <- highlyConnSG(g)$clusters cl <- cl[order(lengths(cl), decreasing=TRUE)] # plot correlation graph pdf(file=snakemake@output[["graph"]]) plot(g, "neato", attrs=list(node=list(fillcolor="lightblue", color="lightblue"))) dev.off() # choose significantly correlated genes chosen <- unique(c(var.cor$gene1[sig.cor], var.cor$gene2[sig.cor])) # get normalized expressions norm.exprs <- logcounts(sce)[chosen,,drop=FALSE] # plot heatmap pdf(file=snakemake@output[["heatmap"]]) # z-score heat.vals <- norm.exprs - rowMeans(norm.exprs) heat.out <- heatmap.2(heat.vals, col=bluered, symbreak=TRUE, trace="none", cexRow=0.6) dev.off() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) fdr <- snakemake@params[["fdr"]] covariate <- gsub("-", ".", snakemake@wildcards[["covariate"]]) sce <- readRDS(snakemake@input[["sce"]]) var.cor <- read.table(snakemake@input[["var_cor"]], header=TRUE) sig.cor <- var.cor$FDR <= fdr # choose significantly correlated genes chosen <- unique(c(var.cor$gene1[sig.cor], var.cor$gene2[sig.cor])) style <- theme( axis.text=element_text(size=12), axis.title=element_text(size=16)) # plot PCA pdf(file=snakemake@output[[1]]) plotPCA(sce[chosen, ], colour_by=covariate, ncomponents=3) + style dev.off() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) library(scran) use.spikes <- snakemake@params[["use_spikes"]] fdr <- snakemake@params[["fdr"]] min.bio.comp <- snakemake@params[["min_bio_comp"]] sce <- readRDS(snakemake@input[["sce_norm"]]) design <- readRDS(snakemake@input[["design_matrix"]]) # apply trend model to obtain per-gene variance # batch effects are modeled in design matrix var.fit <- trendVar(sce, use.spikes=use.spikes, design=design) var.out <- decomposeVar(sce, var.fit) # from now on use expressions with removed batch effects sce <- readRDS(snakemake@input[["sce_batch"]]) # plot mean vs. variance (spikes are red) pdf(file=snakemake@output[["mean_vs_variance"]]) plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", ylab="Variance of log-expression") o <- order(var.out$mean) lines(var.out$mean[o], var.out$tech[o], col="dodgerblue", lwd=2) cur.spike <- isSpike(sce) points(var.out$mean[cur.spike], var.out$total[cur.spike], col="red", pch=16) dev.off() # determine HVGs (highly variable genes) hvg.out <- var.out[which(var.out$FDR <= fdr & var.out$bio >= min.bio.comp),] # sort hvg.out <- hvg.out[order(hvg.out$bio, decreasing=TRUE),] # store HVGs in table on disk write.table(file=snakemake@output[["hvg"]], hvg.out, sep="\t", quote=FALSE, col.names=NA) # plot expression distributions pdf(file=snakemake@output[["hvg_expr_dist"]]) print(hvg.out) plotExpression(sce, rownames(hvg.out)[1:snakemake@params[["show_n"]]]) + theme( axis.text=element_text(size=12), axis.title=element_text(size=16) ) dev.off() # store variance estimates saveRDS(var.out, file=snakemake@output[["var"]]) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) library(scran) seed <- as.integer(snakemake@wildcards[["seed"]]) fdr <- snakemake@params[["fdr"]] covariate <- gsub("-", ".", snakemake@wildcards[["covariate"]]) sce <- readRDS(snakemake@input[["sce"]]) var.cor <- read.table(snakemake@input[["var_cor"]], header=TRUE) sig.cor <- var.cor$FDR <= fdr # choose significantly correlated genes chosen <- unique(c(var.cor$gene1[sig.cor], var.cor$gene2[sig.cor])) style <- theme( axis.text=element_text(size=12), axis.title=element_text(size=16)) # plot t-SNE pdf(file=snakemake@output[[1]]) set.seed(seed) plotTSNE(sce[chosen, ], colour_by=covariate) + style dev.off() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) library(scran) all.counts <- as.matrix(read.table(snakemake@input[["counts"]], row.names=1, header=TRUE)) all.annotation <- read.table(snakemake@input[["cells"]], row.names=1, header=TRUE) sce <- SingleCellExperiment( assays = list(counts = all.counts), colData = all.annotation ) # get feature annotation species = snakemake@params[["species"]] if (species == "mouse") { dataset <- "mmusculus_gene_ensembl" symbol <- "mgi_symbol" } else if (species == "human") { dataset <- "hsapiens_gene_ensembl" symbol <- "hgnc_symbol" } else { stop("Unsupported species. Only mouse and human are supported.") } sce <- getBMFeatureAnnos(sce, filters=c(snakemake@params[["feature_ids"]]), attributes = c("ensembl_gene_id", symbol, "chromosome_name", "gene_biotype", "start_position", "end_position"), dataset=dataset, host=snakemake@config[["counts"]][["biomart"]]) rowData(sce)[, "gene_symbol"] <- rowData(sce)[, symbol] rownames(sce) <- uniquifyFeatureNames(rownames(sce), rowData(sce)$gene_symbol) # get mitochondrial genes is.mito <- colData(sce)$chromosome_name == "MT" # annotate spike-ins is.spike <- grepl(snakemake@params[["spike_pattern"]], rownames(sce)) isSpike(sce, "Spike") <- is.spike # calculate metrics sce <- calculateQCMetrics(sce, feature_controls=list(Spike=is.spike, Mt=is.mito)) colData(sce)[, "detection_rate"] <- sce$total_features_by_counts / nrow(sce) saveRDS(sce, file=snakemake@output[[1]]) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) library(scran) sce <- readRDS(snakemake@input[[1]]) # pre-cluster in order to group similar cells # this avoids a violation of the non-DE assumption preclusters <- quickCluster(sce) # compute size factors by pooling cells according to the clustering sce <- computeSumFactors(sce, clusters=preclusters, min.mean=snakemake@params[["min_count"]]) pdf(file=snakemake@output[["scatter"]]) plot(sizeFactors(sce), sce$total_counts/1e6, log="xy", ylab="Library size (millions)", xlab="Size factor") dev.off() # use size factors to calculate normalized counts in log space (available under logcounts(sce)) sce <- normalize(sce) saveRDS(sce, file=snakemake@output[["rds"]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(ComplexHeatmap) library(viridis) set.seed(562374) fit <- readRDS(snakemake@input[[1]]) pdf(file = snakemake@output[[1]]) Heatmap(fit$mle_params$gamma, col = viridis(100), clustering_distance_rows = "canberra", use_raster = TRUE, show_row_dend = FALSE, show_column_dend = FALSE, show_row_names = FALSE) dev.off() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(tidyverse) library(edgeR) gene_of_interest <- snakemake@wildcards[["gene"]] coef <- snakemake@params[["coef"]] diffexp <- read_tsv(snakemake@input[["diffexp"]]) y <- readRDS(snakemake@input[["edger_dge"]]) log_cpm <- as_tibble(cpm(y, log=TRUE), rownames="gene") %>% gather(key = "cell", value = "cpm", -gene) %>% filter(gene == gene_of_interest) if (nrow(log_cpm) == 0) { stop(paste("Error: gene not found:", gene_of_interest)) } log_cpm <- log_cpm %>% mutate(condition = as.factor(y$design[, coef])) diffexp <- diffexp %>% filter(gene == gene_of_interest) fmt_float <- function(x) { formatC(x, digits=2, format="g") } coef_name <- colnames(y$design)[coef] pdf(file = snakemake@output[[1]]) ggplot(log_cpm, aes(x = condition, y = cpm, fill = condition)) + geom_violin() + geom_jitter(alpha = 0.5, size = 1) + labs(fill = coef_name, title = gene_of_interest, subtitle = paste("p =", fmt_float(diffexp$PValue), ", FDR =", fmt_float(diffexp$FDR), ", log2fc =", fmt_float(diffexp$logFC))) + theme_classic() dev.off() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) library(scran) library(ggsci) source(file.path(snakemake@scriptdir, "common.R")) seed <- as.integer(snakemake@wildcards[["seed"]]) sce <- readRDS(snakemake@input[["sce"]]) cellassign_fit <- snakemake@input[["fit"]] cellassign_fit <- readRDS(cellassign_fit) sce <- assign_celltypes(cellassign_fit, sce, snakemake@params[["min_gamma"]]) # plot gene expression pdf(file=snakemake@output[[1]], width = 12, height = 12) plotExpression(sce, snakemake@params[["genes"]], snakemake@params[["feature"]]) dev.off() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) library(scran) library(ggsci) library(ggpubr) source(file.path(snakemake@scriptdir, "common.R")) gene_a <- snakemake@wildcards[["gene_a"]] gene_b <- snakemake@wildcards[["gene_b"]] aes_name <- function(name) { paste0("`", name, "`") } sce <- readRDS(snakemake@input[["sce"]]) constrain_celltypes <- snakemake@params[["constrain_celltypes"]] for(cellassign_fit in snakemake@input[["fits"]]) { cellassign_fit <- readRDS(cellassign_fit) sce <- assign_celltypes(cellassign_fit, sce, snakemake@params[["min_gamma"]], constrain_celltypes) } # handle constrain celltypes if(!is.null(constrain_celltypes)) { celltypes <- constrain_celltypes print(celltypes) sce <- sce[, colData(sce)$celltype %in% celltypes] } pdf(file=snakemake@output[[1]], width = 4, height = 4) data <- as.data.frame(t(logcounts(sce[c(gene_a, gene_b), ]))) vmin <- min(data) vmax <- max(data) regression <- snakemake@params[["regression"]] correlation <- snakemake@params[["correlation"]] dropout_threshold <- snakemake@params[["dropout_threshold"]] data[, "dropout"] <- apply(data, 1, min) < dropout_threshold non_dropout_data <- data[!data$dropout, ] p <- ggplot(data, aes_string(x=aes_name(gene_a), y=aes_name(gene_b), color=aes_name("dropout"))) + geom_point(shape=1) + scale_color_manual(values = c("#000000", "#666666")) + theme_classic() + theme(legend.position = "none") if(regression != FALSE) { regression <- as.formula(regression) p = p + geom_smooth(method="lm", color = "red", formula = formula, data = non_dropout_data) + stat_regline_equation(data = non_dropout_data, label.x.npc = "left", label.y.npc = "top", formula = formula, aes(label = paste(..eq.label.., ..rr.label.., ..adj.rr.label.., sep = "~~~~"))) } if(correlation != FALSE) { p = p + stat_cor(method = correlation, data = non_dropout_data) + geom_smooth(method="lm", data = non_dropout_data, color = "red") } # pass plot to PDF p dev.off() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(scater) sce <- readRDS(snakemake@input[[1]]) # plot library sizes pdf(file=snakemake@output[["libsizes"]]) hist(sce$total_counts/1e6, xlab="Library sizes (millions)", main="", breaks=20, col="grey80", ylab="Number of cells") dev.off() # plot number of expressed genes pdf(file=snakemake@output[["expressed"]]) hist(sce$total_features_by_counts, xlab="Number of expressed genes", main="", breaks=20, col="grey80", ylab="Number of cells") dev.off() # plot mitochondrial proportion pdf(file=snakemake@output[["mito_proportion"]]) hist(sce$pct_counts_Mt, xlab="Mitochondrial proportion (%)", ylab="Number of cells", breaks=20, main="", col="grey80") dev.off() # plot spike-in proportion pdf(file=snakemake@output[["spike_proportion"]]) hist(sce$pct_counts_Spike, xlab="ERCC proportion (%)", ylab="Number of cells", breaks=20, main="", col="grey80") dev.off() |
Support
- Future updates
Related Workflows





