A Snakemake workflow for standardised sc/snRNAseq analysis
A [Snakemake][sm] workflow for standardised sc/snRNAseq analysis.
Find us in the "Standardized Usage" Section of the [Snakemake Workflow Catalog][sm_wc]
Every single cell analysis is slightly different. This represents what I would call a "core" analysis, as nearly every analysis I perform start with something very akin to this. Given this custom nature of single cell, this workflow is not designed to be all encompassing. Rather, it aims to be extensible , modular , and reproducible . Any given step can be easily modified - as they are all self contained scripts - and a new rule can be easily added - see the downstream rules for an example. Finally, by taking advantage of the integrated [Conda][conda] and [Singularity][sing] support, we can run the whole thing in an isolated environment.
Notes on Installation
A full walkthrough on how to install and use this pipeline can be found [here][sm_wc].
To take advantage of Singularity, you'll need to install that separately. If you are running on a Linux system, then singularity can be installed from conda like so:
conda install -n snakemake -c conda-forge singularity
It's a bit more challenging for other operating systems. Your best bet is to follow their instructions [here][sing_install]. But don't worry! Singularity is not regquired! Snakemake will still run each step in its own Conda environment, it just won't put each Conda environment in a container.
Get the Source Code
Alternatively, you may grab the source code. You likely will not need these steps if you aren't planning to contribute.
Navigate to our [release][releases] page on github and download the most recent version. The following will do the trick:
curl -s https://api.github.com/repos/IMS-Bio2Core-Facility/single_snake_sequencing/releases/latest |
grep tarball_url |
cut -d " " -f 4 |
tr -d '",' |
xargs -n1 curl -sL |
tar xzf -
After querying the github api to get the most recent release information,
we grep for the desired URL,
split the line and extract the field,
trim superfluous characters,
use
xargs
to pipe this to
curl
while allowing for re-directs,
and un-tar the files.
Easy!
Alternatively, for the bleeding edge, please clone the repo like so:
git clone https://github.com/IMS-Bio2Core-Facility/single_snake_sequencing
:warning: Heads Up! The bleeding edge may not be stable, as it contains all active development.
However you choose to install it,
cd
into the directory.
Notes on Data
This pipeline expects de-multiplexed fastq.gz files,
normally produced by some deriviative of
bcl2fastq
after sequencing.
They can (technically) be placed anywhere,
but we recommend creating a
data
directory in your project for them.
Notes on the tools
The analysis pipeline was run using Snakemake v6.6.1.
The full version and software lists can be found under the relevant yaml files in
workflow/envs
.
The all reasonable efforts have been made to ensure that the repository adheres to the best practices
outlined [here][sm_bp].
Notes on the analysis
For a full discussion on the analysis methods, please see the technical documentation .
Briefly,
the count matrix was produced using Cellranger,
droplet calling with
DropletUtils::emptyDrops
,
doublet detection with
SOLO
from the
scVI
family,
batch effect removal with
harmonypy
,
and general analysis and data handling with
scanpy
.
Future work
-
[ ] Supply tests
-
[ ] Track lane in samples that have been pooled and de-multiplexed
-
[ ] Parallelise emptyDrops
-
[ ] Support custom references
-
[ ] Support SCTransform?
Code Snippets
26 27 | script: "../scripts/dim_reduc.py" |
65 66 | script: "../scripts/cluster.py" |
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | shell: """ {input.bin} \ count \ --nosecondary \ {params.introns} \ --id {wildcards.sample} \ --transcriptome {input.genome} \ --fastqs data \ --sample {wildcards.sample} \ --expect-cells {params.n_cells} \ --localcores {threads} \ --localmem {params.mem} \ &> {log} && \ mv {wildcards.sample} results/counts/{wildcards.sample} """ |
20 21 | script: "../scripts/densities.py" |
27 28 | script: "../scripts/filter_empty.R" |
73 74 | script: "../scripts/qc.py" |
97 98 | script: "../scripts/doublets.py" |
12 13 | wrapper: "0.79.0/bio/fastqc" |
30 31 | wrapper: "0.79.0/bio/multiqc" |
13 14 15 16 17 18 19 | shell: """ wget -O resources/cellranger.tar.gz "{params.url}" &> {log} tar -xzf resources/cellranger.tar.gz -C resources &> {log} && \ rm -rf resources/cellranger.tar.gz mv resources/cellranger-* resources/cellranger """ |
31 32 33 34 35 36 37 | shell: """ wget -O resources/genome.tar.gz "{params.url}" &> {log} tar -xzf resources/genome.tar.gz -C resources &> {log} && \ rm -rf resources/genome.tar.gz mv resources/refdata-* resources/genome """ |
49 50 51 52 | shell: """ unzip {input.bcl_zip} -d results/bcl2fastq &> {log} """ |
65 66 67 68 | shell: """ mv {input} {output} &> {log} """ |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | if __name__ == "__main__": from pathlib import Path import matplotlib.pyplot as plt import pandas as pd import scanpy as sc from helpers.logs.get_logger import get_logger from helpers.logs.sc_logs import set_sc_log from helpers.write_de import get_de from matplotlib.backends.backend_pdf import PdfPages LOG = snakemake.log[0] # noqa: F821 PARAMS = snakemake.params # noqa: F821 INPUT = snakemake.input # noqa: F821 OUTPUT = snakemake.output # noqa: F821 THREADS = snakemake.threads # noqa: F821 logger = get_logger(__name__, LOG) sc.settings = set_sc_log(sc.settings, logfile=LOG) sc.settings.n_jobs = THREADS adata = sc.read_h5ad(INPUT["data"]) logger.info(f"Adata read from {INPUT['data']}") logger.info(f"Input data: {adata}") sc.pp.neighbors(adata, use_rep="X_harmony") sc.tl.umap(adata) sc.tl.leiden(adata, resolution=PARAMS["res"]) logger.info( f"{len(adata.obs.leiden.unique())} clusters identified at resolution {PARAMS['res']}" ) _ = sc.pl.umap( adata, color=[ "lane", "sample", "leiden", "total_counts", "pct_counts_mt", "pct_counts_ribo_prot", ], use_raw=True, show=False, save=False, ncols=2, ) plt.savefig(OUTPUT["umap"], dpi=300, bbox_inches="tight") plt.close() # Marker genes if len(PARAMS["markers"]) == 0: Path(OUTPUT["markers"]).touch(exist_ok=True) logger.info("No markers provided. File touched, but empty.") else: with PdfPages(OUTPUT["markers"]) as pdf: for i in range(0, len(PARAMS["markers"]), 9): genes = PARAMS["markers"][i : i + 9] genes = [x for x in genes if x in adata.var_names] fig = sc.pl.umap( adata, color=genes, use_raw=True, return_fig=True, save=False, ncols=3, ) pdf.savefig(fig, dpi=300, bbox_inches="tight") plt.close() logger.info(f"Markers {genes} plotted.") # Run dendrogram sc.tl.dendrogram( adata, use_rep="X_harmony", groupby="leiden", optimal_ordering=True, ) # Differential expression # Not ranking by abs will return the 100 highest scoring upregulated genes sc.tl.rank_genes_groups( adata, groupby="leiden", use_raw=True, method="t-test_overestim_var", pts=True, ) # And plot _ = sc.pl.rank_genes_groups_dotplot( adata, n_genes=5, values_to_plot="logfoldchanges", save=False, show=False, cmap="coolwarm", vmin=-4, vmax=4, colorbar_title="log2FC", ) plt.savefig(OUTPUT["dot"], dpi=300, bbox_inches="tight") plt.close() # Write results results = get_de(adata) with pd.ExcelWriter(OUTPUT["de"], engine="openpyxl") as writer: for j, df in results: df.to_excel(writer, sheet_name=f"cluster_{j}") logger.info(f"Final adata: {adata}") adata.write_h5ad(OUTPUT["data"]) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | if __name__ == "__main__": import matplotlib.pyplot as plt import scanpy as sc from helpers.logs.get_logger import get_logger from helpers.logs.sc_logs import set_sc_log from matplotlib.backends.backend_pdf import PdfPages LOG = snakemake.log[0] # noqa: F821 PARAMS = snakemake.params # noqa: F821 INPUT = snakemake.input # noqa: F821 OUTPUT = snakemake.output # noqa: F821 THREADS = snakemake.threads # noqa: F821 logger = get_logger(__name__, LOG) sc.settings = set_sc_log(sc.settings, logfile=LOG) sc.settings.n_jobs = THREADS adata = sc.read_h5ad(INPUT["data"]) logger.info(f"Adata read from {INPUT['data']}") logger.info(f"Input data: {adata}") # Calculate with PdfPages(OUTPUT["densities"]) as pdf: for feature in PARAMS["features"]: sc.tl.embedding_density(adata, groupby=feature) fig = sc.pl.embedding_density( adata, groupby=feature, group="all", color_map="plasma", ncols=3, save=False, return_fig=True, ) fig.suptitle(f"Densities for feature: {feature}") pdf.savefig(fig, dpi=300, bbox_inches="tight") plt.close() logger.info(f"Feature {feature} plotted") # Save data logger.info(f"Final data: {adata}") adata.write_h5ad(OUTPUT["data"]) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | if __name__ == "__main__": import anndata as ad import matplotlib.pyplot as plt import scanpy as sc import scanpy.external as sce from helpers.logs.get_logger import get_logger from helpers.logs.sc_logs import set_sc_log from helpers.select_pcs import select_pcs LOG = snakemake.log[0] # noqa: F821 PARAMS = snakemake.params # noqa: F821 INPUT = snakemake.input # noqa: F821 OUTPUT = snakemake.output # noqa: F821 THREADS = snakemake.threads # noqa: F821 logger = get_logger(__name__, LOG) sc.settings = set_sc_log(sc.settings, logfile=LOG) sc.settings.n_jobs = THREADS # Concatenate samples adata = ad.concat( [sc.read_h5ad(path) for path in INPUT["data"]], join="outer", merge="same", label=None, ) adata.obs_names_make_unique() logger.info(f"Adata read from {INPUT['data']}") logger.info(f"Input data: {adata}") # HVGs # Before normalisation as seurat_v3 expects raw counts if not PARAMS["nHVG"]: nHVG = max(min(len(adata.obs) / 2, 10000), 1000) logger.info(f"nHVG not provided. Using {nHVG}.") sc.pp.highly_variable_genes( adata, n_top_genes=nHVG, flavor="seurat_v3", batch_key="lane", subset=False, ) _ = sc.pl.highly_variable_genes( adata, log=False, show=False, save=False, ) plt.savefig(OUTPUT["hvg"], dpi=300, bbox_inches="tight") plt.close() # Normalise # Exclude highly expressed to prevent skew of normalisation sc.pp.normalize_total(adata, exclude_highly_expressed=True) sc.pp.log1p(adata) # Save raw and filter adata.raw = adata # Regress and scale # No batch - covered with bbknn sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"], n_jobs=None) sc.pp.scale(adata, max_value=10) # PCA sc.tl.pca(adata, n_comps=50, use_highly_variable=True) _ = sc.pl.pca_variance_ratio(adata, n_pcs=50, show=False, save=False) plt.savefig(OUTPUT["elbow"], dpi=300, bbox_inches="tight") plt.close() # Harmony for batch correction # As it runs on all pcs include, we must first filter to desired npc, adata = select_pcs(adata, threshold=PARAMS["var_thresh"]) logger.info(f"{npc} PCs used.") sce.pp.harmony_integrate( adata, key="lane", adjusted_basis="X_harmony", max_iter_harmony=50, ) # And save adata.write_h5ad(OUTPUT["data"]) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | if __name__ == "__main__": import sys import anndata as ad import matplotlib.pyplot as plt import pandas as pd import scanpy as sc import scvi from helpers.logs.get_logger import get_logger from pandas.plotting import table LOG = snakemake.log[0] # noqa: F821 PARAMS = snakemake.params # noqa: F821 INPUT = snakemake.input # noqa: F821 OUTPUT = snakemake.output # noqa: F821 WILDS = snakemake.wildcards # noqa: F821 logger = get_logger(__name__, LOG) scvi.settings.seed = 42 # Get data adata = ad.concat( [sc.read_h5ad(path) for path in INPUT["data"]], join="outer", merge="same", label=None, ) adata.obs_names_make_unique() adata.obs["lane"] = str(WILDS["lane"]) logger.info(f"Adata read from {INPUT['data']}") logger.info(f"Input data: {adata}") # SCVI doesn't seem to have a way to specify where the output logs to, # So we have to wrap and redirect with open(LOG, "a") as file: sys.stderr = sys.stdout = file # No covariates - not used with SOLO scvi.data.setup_anndata( adata, ) vae = scvi.model.SCVI(adata) vae.train( early_stopping=True, check_val_every_n_epoch=2, early_stopping_patience=20 ) solo = scvi.external.SOLO.from_scvi_model( vae, ) solo.train(train_size=0.9, early_stopping=True, early_stopping_patience=20) preds = solo.predict(soft=True) # SCVI appends a random -0 to the index... adata.obs.index = adata.obs.index + "-0" adata.obs["singlet_score"] = preds.loc[:, "singlet"] adata.obs["doublet_score"] = preds.loc[:, "doublet"] # Save a results table preds = ( pd.DataFrame.from_dict( { "Sample": adata.obs["sample"], "Singlet": adata.obs["singlet_score"] > adata.obs["doublet_score"], "Doublet": adata.obs["singlet_score"] < adata.obs["doublet_score"], } ) .groupby("Sample") .sum() ) ax = plt.subplot(111, frame_on=False) ax.xaxis.set_visible(False) ax.yaxis.set_visible(False) _ = table(ax, preds, cellLoc="center") plt.savefig(OUTPUT["table"], dpi=300, bbox_inches="tight") plt.close() # Filter adata = adata[adata.obs.singlet_score > adata.obs.doublet_score, :] logger.info(f"Data after doublet calls: {adata}") # Clean up after scvi del adata.uns adata.obs = adata.obs.loc[:, ~adata.obs.columns.str.startswith("_")].copy() adata.obs["sample"] = adata.obs["sample"].astype("category") adata.obs["lane"] = adata.obs["lane"].astype("category") adata.write_h5ad(OUTPUT["data"]) logger.info(f"Data saved to {OUTPUT['data']}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | packages <- c("DropletUtils", "logger", "Matrix") invisible( suppressPackageStartupMessages( lapply(packages, library, character.only = TRUE) ) ) set.seed(0) # Configure logger style <- layout_glue_generator("{time} :: {level} :: {namespace} :: {msg}") logfile <- snakemake@log[[1]] log_appender(appender_file(logfile)) log_layout(style) log_threshold(INFO) # Read in data # readMM fails with floats, so we hack python counts <- readMM(snakemake@input[["mtx"]]) log_success("Data read from {snakemake@input[['counts']]}") log_info("Dimensions: {dim(counts)}") totals <- colSums(counts) totals <- totals[totals > 25] totals <- totals[totals < 2000] # The bound beneath which all droplets are "empty" lower <- mean(totals) + (2 * sd(totals)) log_info("Lower threshold: {lower}") # Tail end plot png(snakemake@output[["tail"]], res = 300, height = 4, width = 4, units = "in") invisible(hist(totals, breaks = 500, xlab = "nUMI", main = NULL, col = "black")) invisible(abline(v = lower, col = "darkgreen", lty = 2)) invisible(dev.off()) log_success("Tail end plot saved at {snakemake@output[['tail']]}") # Knee plot bcrank <- barcodeRanks(counts) uniq <- !duplicated(bcrank$rank) png(snakemake@output[["knee"]], res = 300, height = 4, width = 4, units = "in") invisible(plot(bcrank$rank[uniq], bcrank$total[uniq], log = "xy", xlab = "Rank", ylab = "Total UMI count", cex.lab = 1.2 )) invisible(abline(h = lower, col = "darkgreen", lty = 2)) invisible(dev.off()) log_success("Knee plot saved at {snakemake@output[['knee']]}") # And run emptyDrops results <- emptyDrops( counts, lower = lower, niters = snakemake@params[["niters"]] ) write.csv(results, snakemake@output[["empty"]]) log_success("Data saved at {snakemake@output[['empty']]}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | if __name__ == """__main__""": from pathlib import Path import matplotlib.pyplot as plt import pandas as pd import scanpy as sc from helpers.logs.get_logger import get_logger from helpers.logs.sc_logs import set_sc_log LOG = snakemake.log[0] # noqa: F821 PARAMS = snakemake.params # noqa: F821 INPUT = snakemake.input # noqa: F821 OUTPUT = snakemake.output # noqa: F821 WILDS = snakemake.wildcards # noqa: F821 logger = get_logger(__name__, LOG) sc.settings = set_sc_log(sc.settings, logfile=LOG) adata = sc.read_10x_mtx(INPUT["dir"], make_unique=True) adata.obs_names_make_unique() logger.info(f"Adata read from {INPUT['dir']}") logger.info(f"Input data: {adata}") empty = pd.read_csv(INPUT["empty"], index_col=0) empty.index = adata.obs_names empty.FDR = empty.FDR.fillna(1) logger.info(f"Empty read from {INPUT['empty']}") # Transfer call adata.obs["is_cell"] = empty.FDR < 0.005 # Initial filtering adata = adata[adata.obs["is_cell"], :] sc.pp.filter_genes(adata, min_cells=1, inplace=True) # Mark mitochondrial and ribosomal genes adata.var["mt"] = adata.var_names.str.startswith("MT-") adata.var["ribo_prot"] = adata.var_names.str.startswith("RPS", "RPL") # Calculate qc metrics sc.pp.calculate_qc_metrics( adata, qc_vars=["mt", "ribo_prot"], percent_top=None, log1p=False, inplace=True ) # Various plots # Specify various sc figure settings sc.set_figure_params(dpi_save=300, fontsize=12, figsize=(6, 6), format="png") sc.settings.figdir = Path(".") _ = sc.pl.violin( adata, [ "total_counts", "n_genes_by_counts", "pct_counts_mt", ], multi_panel=True, show=False, save=False, ) plt.savefig(OUTPUT["qc_violin"], dpi=300, bbox_inches="tight") plt.close() _ = sc.pl.scatter( adata, "total_counts", "pct_counts_mt", show=False, save=False, ) plt.savefig(OUTPUT["mt_scatter"], dpi=300, bbox_inches="tight") plt.close() _ = sc.pl.scatter( adata, "total_counts", "n_genes_by_counts", show=False, save=False, ) plt.savefig(OUTPUT["n_genes"], dpi=300, bbox_inches="tight") plt.close() # Filter cells adata = adata[adata.obs.pct_counts_mt < PARAMS["pct_counts_mt"], :] adata = adata[adata.obs.total_counts <= PARAMS["total_counts"], :] adata = adata[ adata.obs.n_genes_by_counts >= PARAMS["n_genes_by_counts"], :, ] # Filter genes # If introns, then single nucleus -> mt is artefact, so drop if PARAMS["introns"]: adata = adata[:, ~adata.var.mt] # Final plot... _ = sc.pl.highest_expr_genes(adata, n_top=50, show=False, save=False) plt.savefig(OUTPUT["n_top"], dpi=300, bbox_inches="tight") plt.close() # Mark and save adata.obs["sample"] = WILDS["sample"] adata.write_h5ad(OUTPUT["data"]) logger.info(f"Filtered data: {adata}") logger.info(f"Data saved to {OUTPUT['data']}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path import re from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) # Remove file extension(s) (similar to the internal fastqc approach) base = re.sub("\\.gz$", "", base) base = re.sub("\\.bz2$", "", base) base = re.sub("\\.txt$", "", base) base = re.sub("\\.fastq$", "", base) base = re.sub("\\.fq$", "", base) base = re.sub("\\.sam$", "", base) base = re.sub("\\.bam$", "", base) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell( "fastqc {snakemake.params} -t {snakemake.threads} " "--outdir {tempdir:q} {snakemake.input[0]:q}" " {log}" ) # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path:q} {snakemake.output.html:q}") if snakemake.output.zip != zip_path: shell("mv {zip_path:q} {snakemake.output.zip:q}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
Support
- Future updates
Related Workflows





