A Snakemake workflow to process scRNA-seq data from 10x Genomics

public 1yr ago Version: 3 0 bookmarks

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

A Snakemake workflow to process scRNA-seq data from 10x Genomics

Overview
Installation
Usage
Documentation
Contributing
Authors
Tests
Acknowledgements
License

Overview

Chromium is a Snakemake workflow to process 3' single cell RNA sequencing data from the 10x Genomics platform. It is compatible with 10xv2 and 10xv3 chemistry and features three different quantification methods to obtain both spliced and unspliced abundance estimates:

Installation

Chromium and all of its dependencies can be installed via the mamba package manager:

Install Snakemake and Snakedeploy

$ mamba create -c bioconda -c conda-forge --name snakemake snakemake snakedeploy

Activate the Snakemake environment
```
$ mamba activate snakemake
```
Create a project directory
```
$ mkdir -p path/to/project
```

Deploy the workflow in the project directory

$ snakedeploy deploy-workflow https://github.com/snakemake-workflows/chromium path/to/project

Usage

Create workflow configuration

$ vim config/config.yaml # workflow parameters

Create samples table

$ vim config/samples.csv # sample metadata

Create units table
```
$ vim config/units.csv # unit metdata
```
Test configuration by performing a dry-run
```
$ snakemake -n
```
Execute workflow and deploy software dependencies
```
$ snakemake --cores all --use-conda
```
For more information, see the Usage section of the documentation.

Documentation

Full documentation for Chromium is available here

Support

If you need help, open an issue with one of the following labels:

help wanted (extra attention is needed)
question (further information is requested)

Feedback

If you have any suggestions, open an issue with one of the following labels:

documentation (improvements or additions to documentation)
enhancement (new feature or request)

Contributing

To contribute to Chromium, clone this repository locally and commit your code on a separate branch. Please generate unit tests for your code and run the linter before opening a pull request:

$ snakemake --generate-unit-tests # generate unit tests
$ snakemake --lint # run the linter

You can find more details in the Contributing guide.

Participation in this project is subject to a Code of Conduct .

Authors

Chromium was developed by James Ashmore but has benefited from contributions by the following:

If you would like to be added to this list, please open a pull request with your contribution.

Citation

If you use Chromium in your research, please cite using the DOI: 10.5281/zenodo.4783308

Used By

Chromium is used by the following companies and institutes:

The Centre for Regenerative Medicine (The University of Edinburgh)

If you would like to be added to this list, please open a pull request with your information.

Acknowledgements

The methods were chosen based on this research paper:

Soneson C, Srivastava A, Patro R, Stadler MB (2021) Preprocessing choices affect RNA velocity results for droplet scRNA-seq data. PLoS Comput Biol 17(1): e1008585. https://doi.org/10.1371/journal.pcbi.1008585

The workflow was motivated by the following projects:

The documentation was informed by the following articles:

License

Chromium is licensed under the MIT license.
Copyright © 2020, James Ashmore

Code Snippets

script:
    "../scripts/get_velocity_files.R"

SnakeMake From line 26 of rules/busparse.smk

script:
    "../scripts/read_velocity_output.R"

SnakeMake From line 42 of rules/busparse.smk

shell:
    "bustools correct -o {output.bus} -w {input.txt} {input.bus} 2> {log}"

SnakeMake From line 18 of rules/bustools.smk

shell:
    "bustools sort -t {threads} -o {output.bus} {input.bus} 2> {log}"

SnakeMake From line 32 of rules/bustools.smk

shell:
    "bustools capture -s -x -o {output.bus} -c {input.txt} -e {input.mat} -t {input.txi} {input.bus} 2> {log}"

SnakeMake From line 49 of rules/bustools.smk

shell:
    "bustools capture -s -x -o {output.bus} -c {input.txt} -e {input.mat} -t {input.txi} {input.bus} 2> {log}"

SnakeMake From line 66 of rules/bustools.smk

shell:
    "bustools count -o {params.out} -g {input.tsv} -e {input.mat} -t {input.txt} --genecounts {input.bus} 2> {log}"

SnakeMake From line 85 of rules/bustools.smk

shell:
    "bustools count -o {params.out} -g {input.tsv} -e {input.mat} -t {input.txt} --genecounts {input.bus} 2> {log}"

SnakeMake From line 104 of rules/bustools.smk

script:
    "../scripts/eisar_ranges.R"

SnakeMake From line 20 of rules/eisar.smk

script:
    "../scripts/eisar_fa.R"

SnakeMake From line 36 of rules/eisar.smk

script:
    "../scripts/eisar_gtf.R"

SnakeMake From line 51 of rules/eisar.smk

script:
    "../scripts/eisar_tsv.R"

SnakeMake From line 66 of rules/eisar.smk

script:
    "../scripts/eisar_tx2gene.R"

SnakeMake From line 81 of rules/eisar.smk

shell:
    "genomepy install -g {params} -a {wildcards.genome} 2> {log}"

SnakeMake genomepy From line 23 of rules/genomepy.smk

shell:
    "gunzip -c {input} > {output}"

SnakeMake From line 37 of rules/genomepy.smk

shell:
    "gffread {input} --table transcript_id,gene_id | sort -u > {output} 2> {log}"

SnakeMake gffread From line 17 of rules/gffread.smk

shell:
    "gffread {input} --table gene_id,gene_name | sort -u > {output} 2> {log}"

SnakeMake gffread From line 31 of rules/gffread.smk

shell:
    "gffread {input} --table @chr,gene_id | awk '$1 == MT' | cut -f 2 | sort -u > {output} 2> {log}"

SnakeMake gffread From line 45 of rules/gffread.smk

shell:
    "gffread {input} --table gene_biotype,gene_id | awk '$1 == rRNA' | cut -f 2 | sort -u > {output} 2> {log}"

SnakeMake gffread From line 59 of rules/gffread.smk

shell:
    "kallisto index"
    " --index {output.idx}"
    " {input.fas}"
    " 2> {log}"

SnakeMake kallisto From line 17 of rules/kallisto.smk

shell:
    "kallisto bus"
    " --index {input.idx}"
    " --output-dir {params.out}"
    " --technology {params.ver}"
    " --threads {threads}"
    " {params.fqz}"
    " 2> {log}"

SnakeMake kallisto From line 45 of rules/kallisto.smk

shell:
    "curl {params.url} > {output.txt} 2> {log.err}"

SnakeMake From line 17 of rules/resources.smk

shell:
    "curl {params.url} > {output.txt} 2> {log.err}"

SnakeMake From line 31 of rules/resources.smk

shell:
    "curl {params.url} | gunzip -c > {output.txt} 2> {log.err}"

SnakeMake From line 45 of rules/resources.smk

shell:
    "cat {input} > {output} 2> {log}"

SnakeMake From line 16 of rules/salmon.smk

shell:
    "cut -f 1 {input} > {output} 2> {log}"

SnakeMake From line 28 of rules/salmon.smk

shell:
    "salmon index"
    " --transcripts {input.transcripts}"
    " --index {output.index}"
    " --threads {threads}"
    " --decoys {input.decoys}"
    " --keepDuplicates"
    " 1> {log.out}"
    " 2> {log.err}"

SnakeMake Salmon From line 46 of rules/salmon.smk

shell:
    "salmon alevin"
    " --libType ISR"
    " --index {input.index}"
    " --mates1 {input.mates1}"
    " --mates2 {input.mates2}"
    " {params.arg[protocol]}"
    " --output {output.output}"
    " --threads {threads}"
    " --tgMap {input.tgMap}"
    " --mrna {input.mrna}"
    " --rrna {input.rrna}"
    " 1> {log.out}"
    " 2> {log.err}"

SnakeMake Salmon From line 77 of rules/salmon.smk

script:
    "../scripts/singlecellexperiment_kallisto.R"

SnakeMake From line 21 of rules/singlecellexperiment.smk

script:
    "../scripts/singlecellexperiment_salmon.R"

SnakeMake From line 38 of rules/singlecellexperiment.smk

script:
    "../scripts/singlecellexperiment_star.R"

SnakeMake From line 53 of rules/singlecellexperiment.smk

shell:
    "STAR"
    " --runMode genomeGenerate"
    " --runThreadN {threads}"
    " --genomeDir {output.dir}"
    " --genomeFastaFiles {input.fas}"
    " --sjdbGTFfile {input.gtf}"
    " --sjdbOverhang {params.arg[sjdbOverhang]}"
    " 2> {log}"

SnakeMake STAR From line 22 of rules/star.smk

shell:
    "STAR"
    " --runMode alignReads"
    " --runThreadN {threads}"
    " --genomeDir {input.idx}"
    " --sjdbGTFfile {input.gtf}"
    " --readFilesIn {params.fq2} {params.fq1}"
    " --readFilesCommand gunzip -c"
    " --outFileNamePrefix {params.out}"
    " --outSAMtype BAM SortedByCoordinate"
    " --soloType CB_UMI_Simple"
    " --soloCBwhitelist {input.txt}"
    " --soloFeatures Gene Velocyto"
    " --soloCBstart {params.arg[soloCBstart]}"
    " --soloCBlen {params.arg[soloCBlen]}"
    " --soloUMIstart {params.arg[soloUMIstart]}"
    " --soloUMIlen {params.arg[soloUMIlen]}"
    " --soloBarcodeReadLength {params.arg[soloBarcodeReadLength]}"
    " 2> {log}"

SnakeMake STAR From line 54 of rules/star.smk

main <- function(input, output, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(GenomicFeatures)

    library(Rsamtools)

    grl <- readRDS(input$rds)

    con <- FaFile(input$fas)

    seq <- extractTranscriptSeqs(con, grl)

    writeXStringSet(seq, output$fas)

}

main(snakemake@input, snakemake@output, snakemake@log)

R Rsamtools From line 6 of scripts/eisar_fa.R

main <- function(input, output, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(eisaR)

    grl <- readRDS(input$rds)

    exportToGtf(grl, filepath = output$gtf)

}

main(snakemake@input, snakemake@output, snakemake@log)

R eisaR From line 6 of scripts/eisar_gtf.R

main <- function(input, output, params, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(eisaR)

    len <- switch(params$chemistry, "10xv1" = 98, "10xv2" = 98, "10xv3" = 91)

    grl <- getFeatureRanges(
        gtf = input$gtf,
        featureType = c("spliced", "intron"),
        intronType = "separate",
        flankLength = len,
        joinOverlappingIntrons = FALSE,
        verbose = TRUE
    )

    saveRDS(grl, file = output$rds)

}

main(snakemake@input, snakemake@output, snakemake@params, snakemake@log)

R eisaR From line 6 of scripts/eisar_ranges.R

main <- function(input, output, params, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(GenomicRanges)

    grl <- readRDS(input$rds)

    write.table(
        x = metadata(grl)$corrgene,
        file = output$tsv,
        quote = FALSE,
        sep = "\t",
        row.names = FALSE
    )

}

main(snakemake@input, snakemake@output, snakemake@params, snakemake@log)

R GenomicRanges From line 6 of scripts/eisar_tsv.R

main <- function(input, output, params, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(eisaR)

    grl <- readRDS(input$rds)

    getTx2Gene(grl, filepath = output$tsv)

}

main(snakemake@input, snakemake@output, snakemake@params, snakemake@log)

R eisaR From line 6 of scripts/eisar_tx2gene.R

main <- function(input, params, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(Biostrings)

    library(BUSpaRse)

    len <- switch(params$chemistry, "10xv1" = 98, "10xv2" = 98, "10xv3" = 91)

    dna <- readDNAStringSet(input$fas)

    names(dna) <- sapply(strsplit(names(dna), " "), .subset, 1)

    get_velocity_files(
        X = input$gtf,
        L = len,
        Genome = dna,
        out_path = params$out_path,
        style = params$style,
        transcript_version = NULL,
        gene_version = NULL
    )

}

main(snakemake@input, snakemake@params, snakemake@log)

R BUSpaRse From line 6 of scripts/get_velocity_files.R

main <- function(input, output, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(BUSpaRse)

    spliced_mtx <- sub('\\.mtx$', '', input$mtx[1])

    unspliced_mtx <- sub('\\.mtx$', '', input$mtx[2])

    out <- read_velocity_output(
        spliced_dir = dirname(spliced_mtx),
        unspliced_dir = dirname(unspliced_mtx),
        spliced_name = basename(spliced_mtx),
        unspliced_name = basename(unspliced_mtx)
    )

    row <- intersect(rownames(out$spliced), rownames(out$unspliced))

    col <- intersect(colnames(out$spliced), colnames(out$unspliced))

    out$spliced <- out$spliced[row, col]

    out$unspliced <- out$unspliced[row, col]

    saveRDS(out, file = output$rds)

}

main(snakemake@input, snakemake@output, snakemake@log)

R BUSpaRse From line 6 of scripts/read_velocity_output.R

main <- function(input, output, log, wildcards) {

     # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(SingleCellExperiment)

    mat <- readRDS(input$rds)

    ann <- read.delim(input$tsv, header = FALSE, col.names = c("id", "name"))

    dim <- list(i = rownames(mat$spliced), j = colnames(mat$spliced))

    ind <- match(dim$i, ann$id)

    ann <- ann[ind, ]

    sce <- SingleCellExperiment(

        assays = list(
            counts = mat$spliced,
            spliced = mat$spliced,
            unspliced = mat$unspliced
        ),

        rowData = DataFrame(
            ID = ann$id,
            Symbol = ann$name
        ),

        colData = DataFrame(
            Sample = wildcards$sample,
            Barcode = dim$j
        )
    )

    saveRDS(sce, file = output$rds)

}

main(snakemake@input, snakemake@output, snakemake@log, snakemake@wildcards)

R SingleCellExperiment From line 6 of scripts/singlecellexperiment_kallisto.R

main <- function(input, output, log, wildcards) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Load Bioconductor packages

    library(tximport)

    library(tximeta)

    library(SummarizedExperiment)

    library(SingleCellExperiment)

    # Import salmon quantification

    con <- file.path(input$dir, "alevin", "quants_mat.gz")

    txi <- tximport(con, type = "alevin", dropInfReps = FALSE)

    rse <- SummarizedExperiment(assays = list(counts = round(txi$counts)))

    # Split by spliced and unspliced

    dat <- read.delim(input$tsv[1], header = TRUE, col.names = c("spliced", "unspliced"))

    rse <- splitSE(rse, dat, assayName = "counts")

    # Import and match annotation

    ann <- read.delim(input$tsv[2], header = FALSE, col.names = c("id", "name"))

    ann <- ann[match(rownames(rse), ann$id), ]

    # Create SCE object

    sce <- SingleCellExperiment(
        assays = list(
            counts = assay(rse, "spliced"),
            spliced = assay(rse, "spliced"),
            unspliced = assay(rse, "unspliced")
        ),
        rowData = DataFrame(
            ID = ann$id,
            Symbol = ann$name
        ),
        colData = DataFrame(
            Sample = wildcards$sample,
            Barcode = colnames(rse)
        )
    )

    # Save SCE object

    saveRDS(sce, file = output$rds)

}

main(snakemake@input, snakemake@output, snakemake@log, snakemake@wildcards)

R Salmon tximport SingleCellExperiment SummarizedExperiment tximeta From line 6 of scripts/singlecellexperiment_salmon.R

main <- function(input, output, params, log, wildcards) {

    # Setup logging

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Define directory

    dirname <- input$dir

    dirname <- file.path(dirname, "Solo.out", "Velocyto", "raw")

    # Read features

    names  <- c("id", "name", "type")

    features <- file.path(dirname, "features.tsv")

    features <- data.table::fread(features, header = FALSE, col.names = names)

    # Read barcodes

    names <- c("id")

    barcodes <- file.path(dirname, "barcodes.tsv")

    barcodes <- data.table::fread(barcodes, header = FALSE, col.names = names)

    # Read spliced matrix

    names <- c("feature", "barcode", "count")

    spliced <- file.path(dirname, "spliced.mtx")

    spliced <- data.table::fread(spliced, header = FALSE, skip = 2, col.names = names)

    # Read unspliced matrix

    names <- c("feature", "barcode", "count")

    unspliced <- file.path(dirname, "unspliced.mtx")

    unspliced <- data.table::fread(unspliced, header = FALSE, skip = 2, col.names = names)

    # Set dimensions

    shape <- c(nrow(features), nrow(barcodes))

    names <- list(features$id, barcodes$id)

    # Create object

    sce <- SingleCellExperiment::SingleCellExperiment(

        assays = list(

            counts = Matrix::sparseMatrix(
                i = spliced$feature,
                j = spliced$barcode,
                x = spliced$count,
                dims = shape,
                dimnames = names
            ),

            spliced = Matrix::sparseMatrix(
                i = spliced$feature,
                j = spliced$barcode,
                x = spliced$count,
                dims = shape,
                dimnames = names
            ),

            unspliced = Matrix::sparseMatrix(
                i = unspliced$feature,
                j = unspliced$barcode,
                x = unspliced$count,
                dims = shape,
                dimnames = names
            )
        ),

        rowData = S4Vectors::DataFrame(
            ID = features$id,
            Symbol = features$name
        ),

        colData = S4Vectors::DataFrame(
            Sample = wildcards$sample,
            Barcode = barcodes$id
        )

    )

    # Save object

    saveRDS(sce, file = output$rds)

}

main(snakemake@input, snakemake@output, snakemake@params, snakemake@log, snakemake@wildcards)