A Snakemake workflow to process scRNA-seq data from 10x Genomics
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A Snakemake workflow to process scRNA-seq data from 10x Genomics
Contents
Overview
Chromium is a Snakemake workflow to process 3' single cell RNA sequencing data from the 10x Genomics platform. It is compatible with 10xv2 and 10xv3 chemistry and features three different quantification methods to obtain both spliced and unspliced abundance estimates:
Installation
Chromium and all of its dependencies can be installed via the mamba package manager:
-
Install Snakemake and Snakedeploy
$ mamba create -c bioconda -c conda-forge --name snakemake snakemake snakedeploy
-
Activate the Snakemake environment
$ mamba activate snakemake
-
Create a project directory
$ mkdir -p path/to/project
-
Deploy the workflow in the project directory
$ snakedeploy deploy-workflow https://github.com/snakemake-workflows/chromium path/to/project
Usage
-
Create workflow configuration
$ vim config/config.yaml # workflow parameters
-
Create samples table
$ vim config/samples.csv # sample metadata
-
Create units table
$ vim config/units.csv # unit metdata
-
Test configuration by performing a dry-run
$ snakemake -n
-
Execute workflow and deploy software dependencies
$ snakemake --cores all --use-conda
For more information, see the Usage section of the documentation.
Documentation
Full documentation for Chromium is available here
Support
If you need help, open an issue with one of the following labels:
-
help wanted (extra attention is needed)
-
question (further information is requested)
Feedback
If you have any suggestions, open an issue with one of the following labels:
-
documentation (improvements or additions to documentation)
-
enhancement (new feature or request)
Contributing
To contribute to Chromium, clone this repository locally and commit your code on a separate branch. Please generate unit tests for your code and run the linter before opening a pull request:
$ snakemake --generate-unit-tests # generate unit tests
$ snakemake --lint # run the linter
You can find more details in the Contributing guide.
Participation in this project is subject to a Code of Conduct .
Authors
Chromium was developed by James Ashmore but has benefited from contributions by the following:
If you would like to be added to this list, please open a pull request with your contribution.
Citation
If you use Chromium in your research, please cite using the DOI: 10.5281/zenodo.4783308
Used By
Chromium is used by the following companies and institutes:
If you would like to be added to this list, please open a pull request with your information.
Acknowledgements
The methods were chosen based on this research paper:
Soneson C, Srivastava A, Patro R, Stadler MB (2021) Preprocessing choices affect RNA velocity results for droplet scRNA-seq data. PLoS Comput Biol 17(1): e1008585. https://doi.org/10.1371/journal.pcbi.1008585
The workflow was motivated by the following projects:
The documentation was informed by the following articles:
License
Chromium is licensed under the
MIT
license.
Copyright © 2020, James Ashmore
Code Snippets
26 27 | script: "../scripts/get_velocity_files.R" |
42 43 | script: "../scripts/read_velocity_output.R" |
18 19 | shell: "bustools correct -o {output.bus} -w {input.txt} {input.bus} 2> {log}" |
32 33 | shell: "bustools sort -t {threads} -o {output.bus} {input.bus} 2> {log}" |
49 50 | shell: "bustools capture -s -x -o {output.bus} -c {input.txt} -e {input.mat} -t {input.txi} {input.bus} 2> {log}" |
66 67 | shell: "bustools capture -s -x -o {output.bus} -c {input.txt} -e {input.mat} -t {input.txi} {input.bus} 2> {log}" |
85 86 | shell: "bustools count -o {params.out} -g {input.tsv} -e {input.mat} -t {input.txt} --genecounts {input.bus} 2> {log}" |
104 105 | shell: "bustools count -o {params.out} -g {input.tsv} -e {input.mat} -t {input.txt} --genecounts {input.bus} 2> {log}" |
20 21 | script: "../scripts/eisar_ranges.R" |
36 37 | script: "../scripts/eisar_fa.R" |
51 52 | script: "../scripts/eisar_gtf.R" |
66 67 | script: "../scripts/eisar_tsv.R" |
81 82 | script: "../scripts/eisar_tx2gene.R" |
23 24 | shell: "genomepy install -g {params} -a {wildcards.genome} 2> {log}" |
37 38 | shell: "gunzip -c {input} > {output}" |
17 18 | shell: "gffread {input} --table transcript_id,gene_id | sort -u > {output} 2> {log}" |
31 32 | shell: "gffread {input} --table gene_id,gene_name | sort -u > {output} 2> {log}" |
45 46 | shell: "gffread {input} --table @chr,gene_id | awk '$1 == MT' | cut -f 2 | sort -u > {output} 2> {log}" |
59 60 | shell: "gffread {input} --table gene_biotype,gene_id | awk '$1 == rRNA' | cut -f 2 | sort -u > {output} 2> {log}" |
17 18 19 20 21 | shell: "kallisto index" " --index {output.idx}" " {input.fas}" " 2> {log}" |
45 46 47 48 49 50 51 52 | shell: "kallisto bus" " --index {input.idx}" " --output-dir {params.out}" " --technology {params.ver}" " --threads {threads}" " {params.fqz}" " 2> {log}" |
17 18 | shell: "curl {params.url} > {output.txt} 2> {log.err}" |
31 32 | shell: "curl {params.url} > {output.txt} 2> {log.err}" |
45 46 | shell: "curl {params.url} | gunzip -c > {output.txt} 2> {log.err}" |
16 17 | shell: "cat {input} > {output} 2> {log}" |
28 29 | shell: "cut -f 1 {input} > {output} 2> {log}" |
46 47 48 49 50 51 52 53 54 | shell: "salmon index" " --transcripts {input.transcripts}" " --index {output.index}" " --threads {threads}" " --decoys {input.decoys}" " --keepDuplicates" " 1> {log.out}" " 2> {log.err}" |
77 78 79 80 81 82 83 84 85 86 87 88 89 90 | shell: "salmon alevin" " --libType ISR" " --index {input.index}" " --mates1 {input.mates1}" " --mates2 {input.mates2}" " {params.arg[protocol]}" " --output {output.output}" " --threads {threads}" " --tgMap {input.tgMap}" " --mrna {input.mrna}" " --rrna {input.rrna}" " 1> {log.out}" " 2> {log.err}" |
21 22 | script: "../scripts/singlecellexperiment_kallisto.R" |
38 39 | script: "../scripts/singlecellexperiment_salmon.R" |
53 54 | script: "../scripts/singlecellexperiment_star.R" |
22 23 24 25 26 27 28 29 30 | shell: "STAR" " --runMode genomeGenerate" " --runThreadN {threads}" " --genomeDir {output.dir}" " --genomeFastaFiles {input.fas}" " --sjdbGTFfile {input.gtf}" " --sjdbOverhang {params.arg[sjdbOverhang]}" " 2> {log}" |
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | shell: "STAR" " --runMode alignReads" " --runThreadN {threads}" " --genomeDir {input.idx}" " --sjdbGTFfile {input.gtf}" " --readFilesIn {params.fq2} {params.fq1}" " --readFilesCommand gunzip -c" " --outFileNamePrefix {params.out}" " --outSAMtype BAM SortedByCoordinate" " --soloType CB_UMI_Simple" " --soloCBwhitelist {input.txt}" " --soloFeatures Gene Velocyto" " --soloCBstart {params.arg[soloCBstart]}" " --soloCBlen {params.arg[soloCBlen]}" " --soloUMIstart {params.arg[soloUMIstart]}" " --soloUMIlen {params.arg[soloUMIlen]}" " --soloBarcodeReadLength {params.arg[soloBarcodeReadLength]}" " 2> {log}" |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | main <- function(input, output, log) { # Log function out <- file(log$out, open = "wt") err <- file(log$err, open = "wt") sink(out, type = "output") sink(err, type = "message") # Script function library(GenomicFeatures) library(Rsamtools) grl <- readRDS(input$rds) con <- FaFile(input$fas) seq <- extractTranscriptSeqs(con, grl) writeXStringSet(seq, output$fas) } main(snakemake@input, snakemake@output, snakemake@log) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | main <- function(input, output, log) { # Log function out <- file(log$out, open = "wt") err <- file(log$err, open = "wt") sink(out, type = "output") sink(err, type = "message") # Script function library(eisaR) grl <- readRDS(input$rds) exportToGtf(grl, filepath = output$gtf) } main(snakemake@input, snakemake@output, snakemake@log) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | main <- function(input, output, params, log) { # Log function out <- file(log$out, open = "wt") err <- file(log$err, open = "wt") sink(out, type = "output") sink(err, type = "message") # Script function library(eisaR) len <- switch(params$chemistry, "10xv1" = 98, "10xv2" = 98, "10xv3" = 91) grl <- getFeatureRanges( gtf = input$gtf, featureType = c("spliced", "intron"), intronType = "separate", flankLength = len, joinOverlappingIntrons = FALSE, verbose = TRUE ) saveRDS(grl, file = output$rds) } main(snakemake@input, snakemake@output, snakemake@params, snakemake@log) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | main <- function(input, output, params, log) { # Log function out <- file(log$out, open = "wt") err <- file(log$err, open = "wt") sink(out, type = "output") sink(err, type = "message") # Script function library(GenomicRanges) grl <- readRDS(input$rds) write.table( x = metadata(grl)$corrgene, file = output$tsv, quote = FALSE, sep = "\t", row.names = FALSE ) } main(snakemake@input, snakemake@output, snakemake@params, snakemake@log) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | main <- function(input, output, params, log) { # Log function out <- file(log$out, open = "wt") err <- file(log$err, open = "wt") sink(out, type = "output") sink(err, type = "message") # Script function library(eisaR) grl <- readRDS(input$rds) getTx2Gene(grl, filepath = output$tsv) } main(snakemake@input, snakemake@output, snakemake@params, snakemake@log) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | main <- function(input, params, log) { # Log function out <- file(log$out, open = "wt") err <- file(log$err, open = "wt") sink(out, type = "output") sink(err, type = "message") # Script function library(Biostrings) library(BUSpaRse) len <- switch(params$chemistry, "10xv1" = 98, "10xv2" = 98, "10xv3" = 91) dna <- readDNAStringSet(input$fas) names(dna) <- sapply(strsplit(names(dna), " "), .subset, 1) get_velocity_files( X = input$gtf, L = len, Genome = dna, out_path = params$out_path, style = params$style, transcript_version = NULL, gene_version = NULL ) } main(snakemake@input, snakemake@params, snakemake@log) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | main <- function(input, output, log) { # Log function out <- file(log$out, open = "wt") err <- file(log$err, open = "wt") sink(out, type = "output") sink(err, type = "message") # Script function library(BUSpaRse) spliced_mtx <- sub('\\.mtx$', '', input$mtx[1]) unspliced_mtx <- sub('\\.mtx$', '', input$mtx[2]) out <- read_velocity_output( spliced_dir = dirname(spliced_mtx), unspliced_dir = dirname(unspliced_mtx), spliced_name = basename(spliced_mtx), unspliced_name = basename(unspliced_mtx) ) row <- intersect(rownames(out$spliced), rownames(out$unspliced)) col <- intersect(colnames(out$spliced), colnames(out$unspliced)) out$spliced <- out$spliced[row, col] out$unspliced <- out$unspliced[row, col] saveRDS(out, file = output$rds) } main(snakemake@input, snakemake@output, snakemake@log) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | main <- function(input, output, log, wildcards) { # Log function out <- file(log$out, open = "wt") err <- file(log$err, open = "wt") sink(out, type = "output") sink(err, type = "message") # Script function library(SingleCellExperiment) mat <- readRDS(input$rds) ann <- read.delim(input$tsv, header = FALSE, col.names = c("id", "name")) dim <- list(i = rownames(mat$spliced), j = colnames(mat$spliced)) ind <- match(dim$i, ann$id) ann <- ann[ind, ] sce <- SingleCellExperiment( assays = list( counts = mat$spliced, spliced = mat$spliced, unspliced = mat$unspliced ), rowData = DataFrame( ID = ann$id, Symbol = ann$name ), colData = DataFrame( Sample = wildcards$sample, Barcode = dim$j ) ) saveRDS(sce, file = output$rds) } main(snakemake@input, snakemake@output, snakemake@log, snakemake@wildcards) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | main <- function(input, output, log, wildcards) { # Log function out <- file(log$out, open = "wt") err <- file(log$err, open = "wt") sink(out, type = "output") sink(err, type = "message") # Load Bioconductor packages library(tximport) library(tximeta) library(SummarizedExperiment) library(SingleCellExperiment) # Import salmon quantification con <- file.path(input$dir, "alevin", "quants_mat.gz") txi <- tximport(con, type = "alevin", dropInfReps = FALSE) rse <- SummarizedExperiment(assays = list(counts = round(txi$counts))) # Split by spliced and unspliced dat <- read.delim(input$tsv[1], header = TRUE, col.names = c("spliced", "unspliced")) rse <- splitSE(rse, dat, assayName = "counts") # Import and match annotation ann <- read.delim(input$tsv[2], header = FALSE, col.names = c("id", "name")) ann <- ann[match(rownames(rse), ann$id), ] # Create SCE object sce <- SingleCellExperiment( assays = list( counts = assay(rse, "spliced"), spliced = assay(rse, "spliced"), unspliced = assay(rse, "unspliced") ), rowData = DataFrame( ID = ann$id, Symbol = ann$name ), colData = DataFrame( Sample = wildcards$sample, Barcode = colnames(rse) ) ) # Save SCE object saveRDS(sce, file = output$rds) } main(snakemake@input, snakemake@output, snakemake@log, snakemake@wildcards) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | main <- function(input, output, params, log, wildcards) { # Setup logging out <- file(log$out, open = "wt") err <- file(log$err, open = "wt") sink(out, type = "output") sink(err, type = "message") # Define directory dirname <- input$dir dirname <- file.path(dirname, "Solo.out", "Velocyto", "raw") # Read features names <- c("id", "name", "type") features <- file.path(dirname, "features.tsv") features <- data.table::fread(features, header = FALSE, col.names = names) # Read barcodes names <- c("id") barcodes <- file.path(dirname, "barcodes.tsv") barcodes <- data.table::fread(barcodes, header = FALSE, col.names = names) # Read spliced matrix names <- c("feature", "barcode", "count") spliced <- file.path(dirname, "spliced.mtx") spliced <- data.table::fread(spliced, header = FALSE, skip = 2, col.names = names) # Read unspliced matrix names <- c("feature", "barcode", "count") unspliced <- file.path(dirname, "unspliced.mtx") unspliced <- data.table::fread(unspliced, header = FALSE, skip = 2, col.names = names) # Set dimensions shape <- c(nrow(features), nrow(barcodes)) names <- list(features$id, barcodes$id) # Create object sce <- SingleCellExperiment::SingleCellExperiment( assays = list( counts = Matrix::sparseMatrix( i = spliced$feature, j = spliced$barcode, x = spliced$count, dims = shape, dimnames = names ), spliced = Matrix::sparseMatrix( i = spliced$feature, j = spliced$barcode, x = spliced$count, dims = shape, dimnames = names ), unspliced = Matrix::sparseMatrix( i = unspliced$feature, j = unspliced$barcode, x = unspliced$count, dims = shape, dimnames = names ) ), rowData = S4Vectors::DataFrame( ID = features$id, Symbol = features$name ), colData = S4Vectors::DataFrame( Sample = wildcards$sample, Barcode = barcodes$id ) ) # Save object saveRDS(sce, file = output$rds) } main(snakemake@input, snakemake@output, snakemake@params, snakemake@log, snakemake@wildcards) |
Support
- Future updates
Related Workflows





