Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A Bayesian mixture model approach to zero-inflation of counts for scRNA-seq data
--- Table of Contents
About
Single cell RNA-seq data exhibit large numbers of zero count values, that as demonstrated in our paper , for a subset of transcripts, be better modelled by a zero inflated negative binomial distribution. We develop a novel Dirichlet process mixture model which employs both a mixture at the cell level to model multiple cell types, and a mixture of single cell RNA-seq counts at the transcript level to model the transcript specific zero-inflation of counts. It is shown that this approach outperforms previous approaches that applied multinomial distributions to model single cell RNA-seq counts, and also performs better or comparably to existing top performing methods. By taking a Bayesian approach we are able to build interpretable models of expression within clusters, and to quantify uncertainty in cluster assignments. Applied to a publicly available data set of single cell RNA-seq counts of multiple cell types from the mouse cortex and hippocampus, we demonstrate how our approach can be used to distinguish sub-populations of cells as clusters in the data, and to identify gene sets that are indicative of membership of a sub-population. The methodology is implemented as an open source Snakemake pipeline hosted on this repository.
Features
After running the pipeline, the following graphs are produced allowing the results of clustering by the Bayesian nonparametric zero-inflated negative binomial model to be interpreted: UMAP
All cells are projected onto a 2D plane through UMAP embeddings of their gene expression allowing the clusters assigned by the model to be visualised.
This graph is created under the directory results/plots
Cluster Means
For each cluster, an interactive graph is produced plotting the mean gene expression within that cluster against the mean expression across all clusters. This allows the identification of genes with unusual expression within the discovered sub-populations of cells.
These graphs are created under the directory results/{data}_exp
Average Gene Expression of All Clusters Against Cluster 1
For the up and down regulated genes within each cluster, the results of gene set enrichment analysis with gprofiler2 are presented through interactive Manhattan plots. This enables the identification of significantly enriched biological functions and pathways from well established sources such as Gene Ontology (GO) and KEGG.
This graph is created under the directory results/plots
Running the Code
Dependencies
About | Prior Dependencies | Installation | |
---|---|---|---|
![]() Anaconda |
Anaconda is a distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment. |
|
Download from the Anaconda site |
![]() Singularity |
Singularity is a container platform. Within a container, required programs, libraries, data and scripts can be packaged together and distributed to create a reproducable runtime environment. |
|
See Singularity Installation |
![]() Snakemake |
Snakemake is a workflow management system for reproducible and scalable data analysis through the creation of pipelines. The workflow is defined in a ‘Snakefile’ consisting of rules that represent how to create output files from input files. |
|
See Snakemake Installation |
Instructions
Set Up
1. Download the code from the main brach
- Either directly from GitHub by pressing Code ➜ Download ZIP
- Or through a Linux terminal using the command below
(base) user@terminal:~$ wget https://github.com/tt104/scmixture/archive/refs/heads/main.zip
2. Extract the contents of zip file
- Either locate the file and pressing "Extract"
-
Or if using the terminal, use the command below, where
scmixture.zip
is the file path of the zip file and~/new_file_path
is the location where you want to save the file
(base) user@terminal:~$ unzip scmixture.zip -d ~/new_file_path
3.
Once the project has been unzipped, navigate to the project folder, where
scmixture
is the file path
(base) user@terminal:~$ cd scmixture
Configuration **5.** Place one or more scRNA-seq csv dataset(s) into folder `scmixture/data`
- If the folder does not exist, you can create it using the command below
(base) user@terminal:~$ mkdir data
6. Optionally edit config/config.yaml to change the algorithm's settings
# K-means initialisation - no. of clusters
kmeans: 40
# Controls the number of transcripts selected to consider in clustering
filter: 1000
# Burn-in, total number of MCMC chain iterations, thinning out interval and total number of chains
burnin: 500
iter: 1000
thin: 10
chains: 4
# model - nb or mult
model: "nb"
Running the Code **7.** Activate the snakemake environment through conda
(base) user@terminal:~/scmixture$ conda activate snakemake
8. Run the code through the command below:
(snakemake) user@terminal:~/scmixture$ snakemake --use-singularity --cores n --config data=DATASET organism=ORGANISM_ID
-
Replace
n
in--cores n
(or-cn
) with the number of cores you wish to use -
Replace
DATASET
with the name of your dataset, ignoring the .csv file extension, e.g.--config data=GBM
-
If running g:Profiler, replace
ORGANISM_ID
with the g:Profiler id for the data's subject (for examplehsapiens
for human ormmusculus
for mouse), otherwiseorganism=ORGANISM_ID
can be omitted from the command
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | dir.create(snakemake@output[[1]]) # Fetch cluster means and dispersions clusterMu<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1)) clusterOmega<-as.matrix(read.csv(snakemake@input[[2]],header=TRUE,row.names=1)) nc<-nrow(clusterMu) names<-colnames(clusterMu) avmu<-colMeans(clusterMu) avom<-colMeans(clusterOmega) qup<-qnbinom(0.95,size=avom,mu=avmu) qlo<-qnbinom(0.05,size=avom,mu=avmu) for(cl in c(1:nc)) { clmu<-clusterMu[cl,] clom<-clusterOmega[cl,] up<-which(clmu>qup) down<-which(clmu<qlo) upgenes<-names[up] downgenes<-names[down] if(length(upgenes)!=0) { write.table(upgenes,paste(snakemake@output[[1]],"/Cluster_",cl,"_up.csv",sep=''),sep=',',row.names=FALSE,col.names=FALSE,quote=FALSE) } if(length(downgenes)!=0) { write.table(downgenes,paste(snakemake@output[[1]],"/Cluster_",cl,"_down.csv",sep=''),sep=',',row.names=FALSE,col.names=FALSE,quote=FALSE) } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | library(gprofiler2) library(htmlwidgets) # Iterate through up and down genes for (i in 1:length(snakemake@input[])) { genelist <- read.csv(snakemake@input[[i]], header=FALSE)[,1] if(length(genelist)>0) { gostres <- gost(query = genelist ,organism = snakemake@config[["organism"]]) if(!is.null(gostres)) { p <- gostplot(gostres, interactive = TRUE) htmlwidgets::saveWidget(p,snakemake@output[[i]]) } else { system(paste("echo No enrichment > ",snakemake@output[[i]],sep='')) } } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | logsum<-function(a,b) { m<-pmax(a,b) log(exp(a-m)+exp(b-m))+m } maxlikeNB2<-function(xs,scales) { mu<-NULL os<-NULL ws<-NULL for(j in 1:ncol(xs)) { mask<-which(xs[,j]>0) zmask<-which(xs[,j]==0) nblikeZI<-function(ps) { if(length(zmask>0)) { l0<- -sum(logsum(log(ps[3])+dnbinom(xs[zmask,j],size=ps[1],mu=scales[zmask]*ps[2],log=TRUE),rep(log(1-ps[3]),length(zmask)))) } else { l0<-0.0 } l<- -sum(log(ps[3])+dnbinom(xs[mask,j],size=ps[1],mu=scales[mask]*ps[2],log=TRUE)) p<- -dgamma(ps[1],shape=1,rate=0.01,log=TRUE) l0+l+p } mlzi<-optim(c(1,mean(xs[,j]),0.5),nblikeZI,method="L-BFGS-B",lower=c(0.001,0.001,0.001),upper=c(Inf,Inf,0.9999)) mu<-c(mu,mlzi$par[2]) os<-c(os,mlzi$par[1]) ws<-c(ws,mlzi$par[3]) } list(mu=mu,os=os,ws=ws) } d<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1)) dd<-t(d) ldata<-log(dd+1) vs<-apply(ldata,2,var) cv<-sqrt(exp(vs-1)) ix<-order(cv,decreasing=TRUE)[1:1000] xs<-t(d)[,ix[1:1000]] scales<-as.matrix(read.csv(snakemake@input[[2]],header=FALSE)) res<-maxlikeNB2(xs,scales) ms<-res$mu os<-res$os ws<-res$ws gmulike<-function(ps) { -sum(dgamma(ms,shape=ps[1],rate=ps[2],log=TRUE)) } mug<-optim(c(mean(ms),1),gmulike,method="L-BFGS-B",lower=c(0.00001,0.00001),upper=c(Inf,Inf)) gomegalike<-function(ps) { -sum(dgamma(os,shape=ps[1],rate=ps[2],log=TRUE)) } omegag<-optim(c(mean(os),1),gomegalike,method="L-BFGS-B",lower=c(0.00001,0.00001),upper=c(Inf,Inf)) hp<-c(mug$par,omegag$par) write.table(hp,snakemake@output[[1]],row.names=FALSE,col.names=FALSE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | library(plotly) library(htmlwidgets) # Open as cluster means as data.frame cluster_data <- read.csv(snakemake@input[[1]], header=TRUE) # Find the mean gene expression for each gene across all clusters mean_values <- colMeans(cluster_data) # Get the gene names from the columns gene_names <- colnames(cluster_data) # Create list of column numbers representing genes gene_ids <- c(1:ncol(cluster_data)) create_plot <- function(cluster_num) { # Get the gene counts for the selected cluster cluster_values <- unname(cluster_data[cluster_num,]) # Plot the mean gene expression across clusters fig <- plot_ly(cluster_data, x = ~gene_names, y = ~mean_values, type = 'bar', name = 'Mean Expression of All Clusters', hoverlabel = list(namelength = -1), marker = list(color = 'grey')) # Plot the gene expression of the selected cluster fig <- fig %>% add_markers(y = ~cluster_values, name = paste('Cluster', cluster_num, 'Mean Expression'), marker = list( color = ~cluster_values, colorscale = 'Jet', line = list(color = 'black', width = 1))) # Name the graph and hide the x-axis fig <- fig %>% layout(title = paste("Average Gene Expression of All Clusters Against Cluster", cluster_num), xaxis = list(title = "Gene", zeroline = FALSE, showline = FALSE, showticklabels = FALSE, showgrid = FALSE), yaxis = list(title = "Gene Expression"), plot_bgcolor='rgb(245, 245, 245)', hovermode = "x unified") # Save the graph to HTML htmlwidgets::saveWidget(fig, file.path(gsub(" ", "", paste(snakemake@output[[1]], "/Cluster", cluster_num, "Means.html")))) } # Create the output directory if it does not already exist dir.create(file.path(snakemake@output[[1]]), showWarnings = FALSE) # Make a graph for each cluster print(paste('Creating mean expression graphs for', nrow(cluster_data), 'clusters')) for (cluster in 1:nrow(cluster_data)) { create_plot(cluster) } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | library(umap) library(ggplot2) # Original data D<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1)) clusters<-as.matrix(read.csv(snakemake@input[[2]],header=TRUE,row.names=1)) uD<-umap(log(t(D)+1)) pdf(snakemake@output[[1]]) udf<-data.frame(x=uD$layout[,1],y=uD$layout[,2],Cluster=as.factor(clusters[,1])) p<-ggplot(udf)+geom_point(aes(x,y,color=Cluster))+xlab("UMAP 1")+ylab("UMAP 2") print(p) dev.off() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | D<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1)) cellnames<-colnames(D) # Gene names selected names<-as.matrix(read.csv(snakemake@input[[5]],header=FALSE))[,1] ngene<-length(names) # Re-map clusters clustersRaw<-as.matrix(read.csv(snakemake@input[[2]],header=FALSE)) clist<-unique(clustersRaw) nc<-length(clist) clusters<-rep(0,length(clustersRaw)) clustersDict<-list() for(i in 1:nc) { t<-clist[i] clusters[clustersRaw==t]<-i clustersDict[[t]]<-i } # Write counts table write.table(clusters,snakemake@output[[1]],sep=',',row.names=cellnames,col.names=c("Cluster"),quote=FALSE) write.table(table(clusters),snakemake@output[[2]],sep=',') # Fetch cluster means and dispersions clusterMuRaw<-as.matrix(read.csv(snakemake@input[[3]],header=FALSE)) clusterOmegaRaw<-as.matrix(read.csv(snakemake@input[[4]],header=FALSE)) clusterMu<-matrix(0,nrow=nc,ncol=ngene) colnames(clusterMu)<-names rownames(clusterMu)<-c(1:nc) clusterOmega<-matrix(0,nrow=nc,ncol=ngene) colnames(clusterOmega)<-names rownames(clusterOmega)<-c(1:nc) for(i in 1:nc) { t<-clist[i] s<-clustersDict[[t]] clusterMu[s,]<-clusterMuRaw[t,] clusterOmega[s,]<-clusterOmegaRaw[t,] } write.table(clusterMu,snakemake@output[[3]],sep=',',quote=FALSE) write.table(clusterOmega,snakemake@output[[4]],sep=',',quote=FALSE) |
1 2 3 4 5 6 7 8 | library(scran) D<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1)) clusters <- quickCluster(D) sce <- calculateSumFactors(D, clusters=clusters) write.table(sce,snakemake@output[[1]],sep=',',row.names=FALSE,col.names=FALSE) |
36 37 38 39 | shell: ''' rm finished.txt ''' |
48 | script: "scripts/scales.R" |
56 | script: "scripts/hyperparameters.R" |
70 | script: "scripts/scmix.jl" |
85 | script: "scripts/post.R" |
96 | script: "scripts/plots.R" |
105 106 | script: "scripts/genes.R" |
113 | script: "scripts/gprof.R" |
121 | script: "scripts/meanplots.R" |
Support
- Future updates
Related Workflows





