Mixture Model for Zero-Inflated scRNA-seq Data Analysis

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

A Bayesian mixture model approach to zero-inflation of counts for scRNA-seq data

--- Table of Contents

About
Features
Running the Code

---

About

Single cell RNA-seq data exhibit large numbers of zero count values, that as demonstrated in our paper , for a subset of transcripts, be better modelled by a zero inflated negative binomial distribution. We develop a novel Dirichlet process mixture model which employs both a mixture at the cell level to model multiple cell types, and a mixture of single cell RNA-seq counts at the transcript level to model the transcript specific zero-inflation of counts. It is shown that this approach outperforms previous approaches that applied multinomial distributions to model single cell RNA-seq counts, and also performs better or comparably to existing top performing methods. By taking a Bayesian approach we are able to build interpretable models of expression within clusters, and to quantify uncertainty in cluster assignments. Applied to a publicly available data set of single cell RNA-seq counts of multiple cell types from the mouse cortex and hippocampus, we demonstrate how our approach can be used to distinguish sub-populations of cells as clusters in the data, and to identify gene sets that are indicative of membership of a sub-population. The methodology is implemented as an open source Snakemake pipeline hosted on this repository.

Identifying sub-populations of cells in single cell transcriptomic data – a Bayesian mixture model approach to zero-inflation of counts

---

Features

After running the pipeline, the following graphs are produced allowing the results of clustering by the Bayesian nonparametric zero-inflated negative binomial model to be interpreted: UMAP

All cells are projected onto a 2D plane through UMAP embeddings of their gene expression allowing the clusters assigned by the model to be visualised.

This graph is created under the directory results/plots

Cluster Means

For each cluster, an interactive graph is produced plotting the mean gene expression within that cluster against the mean expression across all clusters. This allows the identification of genes with unusual expression within the discovered sub-populations of cells.

These graphs are created under the directory results/{data}_exp

Average Gene Expression of All Clusters Against Cluster 1

g:Profiler

For the up and down regulated genes within each cluster, the results of gene set enrichment analysis with gprofiler2 are presented through interactive Manhattan plots. This enables the identification of significantly enriched biological functions and pathways from well established sources such as Gene Ontology (GO) and KEGG.

This graph is created under the directory results/plots

---

Running the Code

Dependencies

	About	Prior Dependencies	Installation
Anaconda	Anaconda is a distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment.	N/A	Download from the Anaconda site
Singularity	Singularity is a container platform. Within a container, required programs, libraries, data and scripts can be packaged together and distributed to create a reproducable runtime environment.	Singularity runs on Linux, though can also be run on Windows and Mac through a virtual machine such as VirtualBox Anaconda	See Singularity Installation
Snakemake	Snakemake is a workflow management system for reproducible and scalable data analysis through the creation of pipelines. The workflow is defined in a ‘Snakefile’ consisting of rules that represent how to create output files from input files.	Anaconda	See Snakemake Installation

Instructions

Set Up

1. Download the code from the main brach

Either directly from GitHub by pressing Code ➜ Download ZIP

Or through a Linux terminal using the command below

(base) user@terminal:~$ wget https://github.com/tt104/scmixture/archive/refs/heads/main.zip

2. Extract the contents of zip file

Either locate the file and pressing "Extract"

Or if using the terminal, use the command below, where scmixture.zip is the file path of the zip file and ~/new_file_path is the location where you want to save the file

(base) user@terminal:~$ unzip scmixture.zip -d ~/new_file_path

3. Once the project has been unzipped, navigate to the project folder, where scmixture is the file path

(base) user@terminal:~$ cd scmixture

Configuration **5.** Place one or more scRNA-seq csv dataset(s) into folder `scmixture/data`

If the folder does not exist, you can create it using the command below

(base) user@terminal:~$ mkdir data

6. Optionally edit config/config.yaml to change the algorithm's settings

# K-means initialisation - no. of clusters
kmeans: 40
# Controls the number of transcripts selected to consider in clustering
filter: 1000
# Burn-in, total number of MCMC chain iterations, thinning out interval and total number of chains
burnin: 500
iter: 1000
thin: 10
chains: 4
# model - nb or mult
model: "nb"

Running the Code **7.** Activate the snakemake environment through conda

(base) user@terminal:~/scmixture$ conda activate snakemake

8. Run the code through the command below:

(snakemake) user@terminal:~/scmixture$ snakemake --use-singularity --cores n --config data=DATASET organism=ORGANISM_ID

Replace n in --cores n (or -cn ) with the number of cores you wish to use
Replace DATASET with the name of your dataset, ignoring the .csv file extension, e.g. --config data=GBM
If running g:Profiler, replace ORGANISM_ID with the g:Profiler id for the data's subject (for example hsapiens for human or mmusculus for mouse), otherwise organism=ORGANISM_ID can be omitted from the command

Code Snippets

dir.create(snakemake@output[[1]])

# Fetch cluster means and dispersions
clusterMu<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1))
clusterOmega<-as.matrix(read.csv(snakemake@input[[2]],header=TRUE,row.names=1))

nc<-nrow(clusterMu)
names<-colnames(clusterMu)

avmu<-colMeans(clusterMu)
avom<-colMeans(clusterOmega)
qup<-qnbinom(0.95,size=avom,mu=avmu)
qlo<-qnbinom(0.05,size=avom,mu=avmu)

for(cl in c(1:nc))
{
	clmu<-clusterMu[cl,]
	clom<-clusterOmega[cl,]

	up<-which(clmu>qup)
	down<-which(clmu<qlo)
	upgenes<-names[up]
	downgenes<-names[down]

	if(length(upgenes)!=0)
	{
	write.table(upgenes,paste(snakemake@output[[1]],"/Cluster_",cl,"_up.csv",sep=''),sep=',',row.names=FALSE,col.names=FALSE,quote=FALSE)
	}
	if(length(downgenes)!=0)
	{
	write.table(downgenes,paste(snakemake@output[[1]],"/Cluster_",cl,"_down.csv",sep=''),sep=',',row.names=FALSE,col.names=FALSE,quote=FALSE)
	}
}

R From line 1 of scripts/genes.R

library(gprofiler2)
library(htmlwidgets)

# Iterate through up and down genes
for (i in 1:length(snakemake@input[])) {
  genelist <- read.csv(snakemake@input[[i]], header=FALSE)[,1]

  if(length(genelist)>0)
  {
    gostres <- gost(query = genelist ,organism = snakemake@config[["organism"]])
    if(!is.null(gostres))
    {
      p <- gostplot(gostres, interactive = TRUE)
      htmlwidgets::saveWidget(p,snakemake@output[[i]])
    }
    else
    {
      system(paste("echo No enrichment > ",snakemake@output[[i]],sep=''))
    }
  }
}

R htmlwidgets gprofiler2 From line 1 of scripts/gprof.R

logsum<-function(a,b)
{
	m<-pmax(a,b)
	log(exp(a-m)+exp(b-m))+m
}

maxlikeNB2<-function(xs,scales)
{
	mu<-NULL
	os<-NULL
	ws<-NULL

	for(j in 1:ncol(xs))
	{
		mask<-which(xs[,j]>0)
		zmask<-which(xs[,j]==0)
		nblikeZI<-function(ps)
		{
			if(length(zmask>0))
			{
				l0<- -sum(logsum(log(ps[3])+dnbinom(xs[zmask,j],size=ps[1],mu=scales[zmask]*ps[2],log=TRUE),rep(log(1-ps[3]),length(zmask))))
			}
			else
			{
				l0<-0.0
			}
			l<- -sum(log(ps[3])+dnbinom(xs[mask,j],size=ps[1],mu=scales[mask]*ps[2],log=TRUE))
			p<- -dgamma(ps[1],shape=1,rate=0.01,log=TRUE)
			l0+l+p
		}
		mlzi<-optim(c(1,mean(xs[,j]),0.5),nblikeZI,method="L-BFGS-B",lower=c(0.001,0.001,0.001),upper=c(Inf,Inf,0.9999))
		mu<-c(mu,mlzi$par[2])
		os<-c(os,mlzi$par[1])
		ws<-c(ws,mlzi$par[3])
	}
	list(mu=mu,os=os,ws=ws)
}

d<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1))

dd<-t(d)

ldata<-log(dd+1)
vs<-apply(ldata,2,var)
cv<-sqrt(exp(vs-1))
ix<-order(cv,decreasing=TRUE)[1:1000]

xs<-t(d)[,ix[1:1000]]

scales<-as.matrix(read.csv(snakemake@input[[2]],header=FALSE))

res<-maxlikeNB2(xs,scales)

ms<-res$mu
os<-res$os
ws<-res$ws

gmulike<-function(ps)
{
	        -sum(dgamma(ms,shape=ps[1],rate=ps[2],log=TRUE))
}

mug<-optim(c(mean(ms),1),gmulike,method="L-BFGS-B",lower=c(0.00001,0.00001),upper=c(Inf,Inf))

gomegalike<-function(ps)
{
	        -sum(dgamma(os,shape=ps[1],rate=ps[2],log=TRUE))
}

omegag<-optim(c(mean(os),1),gomegalike,method="L-BFGS-B",lower=c(0.00001,0.00001),upper=c(Inf,Inf))

hp<-c(mug$par,omegag$par)

write.table(hp,snakemake@output[[1]],row.names=FALSE,col.names=FALSE)

R From line 1 of scripts/hyperparameters.R

library(plotly)
library(htmlwidgets)

# Open as cluster means as data.frame
cluster_data <- read.csv(snakemake@input[[1]], header=TRUE)

# Find the mean gene expression for each gene across all clusters
mean_values <- colMeans(cluster_data)
# Get the gene names from the columns
gene_names <- colnames(cluster_data)
# Create list of column numbers representing genes
gene_ids <- c(1:ncol(cluster_data))

create_plot <- function(cluster_num) {
  # Get the gene counts for the selected cluster
  cluster_values <- unname(cluster_data[cluster_num,])
  # Plot the mean gene expression across clusters
  fig <- plot_ly(cluster_data, x = ~gene_names, y = ~mean_values, type = 'bar', 
                 name = 'Mean Expression of All Clusters',
                 hoverlabel = list(namelength = -1),
                 marker = list(color = 'grey'))
  # Plot the gene expression of the selected cluster
  fig <- fig %>% add_markers(y = ~cluster_values, 
                             name = paste('Cluster', cluster_num, 'Mean Expression'),
                             marker = list(
                               color = ~cluster_values,
                               colorscale = 'Jet',
                               line = list(color = 'black', width = 1)))
  # Name the graph and hide the x-axis
  fig <- fig %>% layout(title = paste("Average Gene Expression of All Clusters Against Cluster", cluster_num),
                        xaxis = list(title = "Gene", zeroline = FALSE, showline = FALSE, 
                                     showticklabels = FALSE, showgrid = FALSE),
                        yaxis = list(title = "Gene Expression"),
                        plot_bgcolor='rgb(245, 245, 245)',
                        hovermode = "x unified")
  # Save the graph to HTML
  htmlwidgets::saveWidget(fig, file.path(gsub(" ", "", paste(snakemake@output[[1]], "/Cluster", cluster_num, "Means.html"))))
}

# Create the output directory if it does not already exist
dir.create(file.path(snakemake@output[[1]]), showWarnings = FALSE)

# Make a graph for each cluster
print(paste('Creating mean expression graphs for', nrow(cluster_data), 'clusters'))
for (cluster in 1:nrow(cluster_data)) {
  create_plot(cluster)
}

R plotly htmlwidgets From line 1 of scripts/meanplots.R

library(umap)
library(ggplot2)

# Original data
D<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1))

clusters<-as.matrix(read.csv(snakemake@input[[2]],header=TRUE,row.names=1))

uD<-umap(log(t(D)+1))

pdf(snakemake@output[[1]])
udf<-data.frame(x=uD$layout[,1],y=uD$layout[,2],Cluster=as.factor(clusters[,1]))
p<-ggplot(udf)+geom_point(aes(x,y,color=Cluster))+xlab("UMAP 1")+ylab("UMAP 2")
print(p)
dev.off()

R ggplot2 umap From line 1 of scripts/plots.R

D<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1))
cellnames<-colnames(D)

# Gene names selected
names<-as.matrix(read.csv(snakemake@input[[5]],header=FALSE))[,1]
ngene<-length(names)

# Re-map clusters
clustersRaw<-as.matrix(read.csv(snakemake@input[[2]],header=FALSE))
clist<-unique(clustersRaw)

nc<-length(clist)

clusters<-rep(0,length(clustersRaw))
clustersDict<-list()

for(i in 1:nc)
{
	t<-clist[i]
	clusters[clustersRaw==t]<-i
	clustersDict[[t]]<-i
}

# Write counts table
write.table(clusters,snakemake@output[[1]],sep=',',row.names=cellnames,col.names=c("Cluster"),quote=FALSE)
write.table(table(clusters),snakemake@output[[2]],sep=',')

# Fetch cluster means and dispersions
clusterMuRaw<-as.matrix(read.csv(snakemake@input[[3]],header=FALSE))
clusterOmegaRaw<-as.matrix(read.csv(snakemake@input[[4]],header=FALSE))


clusterMu<-matrix(0,nrow=nc,ncol=ngene)
colnames(clusterMu)<-names
rownames(clusterMu)<-c(1:nc)

clusterOmega<-matrix(0,nrow=nc,ncol=ngene)
colnames(clusterOmega)<-names
rownames(clusterOmega)<-c(1:nc)

for(i in 1:nc)
{
	t<-clist[i]
	s<-clustersDict[[t]]
	clusterMu[s,]<-clusterMuRaw[t,]
	clusterOmega[s,]<-clusterOmegaRaw[t,]
}

write.table(clusterMu,snakemake@output[[3]],sep=',',quote=FALSE)
write.table(clusterOmega,snakemake@output[[4]],sep=',',quote=FALSE)

R From line 2 of scripts/post.R

library(scran)

D<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1))

clusters <- quickCluster(D)
sce <- calculateSumFactors(D, clusters=clusters)

write.table(sce,snakemake@output[[1]],sep=',',row.names=FALSE,col.names=FALSE)