Snakemake pipeline for single-cell RNA-seq analysis

public 1yr ago Version: 2 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Setup

Testing

# generate 10x v3 scRNA-seq test data
cd .test/ngs-test-data
snakemake scrnaseq_10x_v3 --use-conda -c1 --conda-cleanup-pkgs cache
# test the workflow
cd ../..
snakemake all --use-conda -c1 --directory .test --show-failed-logs

Configuration

Create two TSVs: (1) a "runsheet" that describes each 10x run and (2) a "patientsheet" that describes each patient

The runsheet should have the following columns

run_id : a unique identifier for each run
r1 : the path to the R1 fastq file
r2 : the path to the R2 fastq file
total_barcodes : the total number of barcodes in the run
cells_loaded : the number of cells loaded into the run
expected_cells : the number of cells expected to be yielded in the run
expected_multiplet_rate : the expected multiplet rate of the run

The patientsheet should have the following columns

patient_id : a unique identifier for each patient
condition : the condition of the patient
Enter the paths to the runsheet and patientsheet in the config.yaml file. Specify an output directory.

outdir: <path to output directory>
runsheet: <path to runsheet>
patientsheet: <path to patientsheet>

Running the workflow

# run the workflow all the way through
# -c16 = use 16 cores, this can be changed to whatever you want
outdir="<path to output directory>"
snakemake all --use-conda -c16 --show-failed-logs --config outdir="$outdir"

To run until a certain step, use the --until flag

# run until a certain step
snakemake all --use-conda -c16 --show-failed-logs --config outdir="$outdir" --until <rule name>
# IE: stop after indexing the genome
snakemake all --use-conda -c16 --show-failed-logs --config outdir="$outdir" --until STARindex
# IE: stop after the STARsolo
snakemake all --use-conda -c16 --show-failed-logs --config outdir="$outdir" --until STARsolo

TODO:

[ ] Use different rules to run different stages of pipeline (instead of all , use map_count , compass , preprocess , etc.)

Code Snippets

shell:
    """
    python {input} install
    touch {output}
    """

SnakeMake From line 8 of rules/compass.smk

script:
    "../scripts/prepare_data_for_compass.py"

SnakeMake From line 28 of rules/compass.smk

shell:
    """
    compass --precache --species homo_sapiens
    touch {output}
    """

SnakeMake From line 39 of rules/compass.smk

shell:
    """
    tmpdir=$(mktemp -d -p $(dirname {output}))
    compass \
        --data-mtx {input.norm} {input.features} {input.barcodes} \
        --output-dir $(dirname {output}) \
        --microcluster-size 10 \
        --num-processes {threads} \
        --temp-dir $tmpdir \
        --species homo_sapiens 2> {log}
    """

SnakeMake From line 61 of rules/compass.smk

shell:
    """
    git clone https://github.com/aertslab/popscle_helper_tools.git {output}
    """

SnakeMake From line 4 of rules/demuxlet.smk

shell:
    """
    touch {log} && exec > {log} 2>&1
    source {input.tools}/filter_vcf_file_for_popscle.sh
    check_if_programs_exists

    tmpvcf=$(mktemp)
    trap 'rm -f $tmpvcf' EXIT

    # subset samples of interest
    bcftools view -s {params.samples} -Ou {input.vcf} > $tmpvcf

    {input.tools}/sort_vcf_same_as_bam.sh {input.bam} $tmpvcf \
    | only_keep_snps \
    | calculate_AF_AC_AN_values_based_on_genotype_info \
    | filter_out_mutations_missing_genotype_for_one_or_more_samples \
    | filter_out_mutations_homozygous_reference_in_all_samples \
    | filter_out_mutations_homozygous_in_all_samples > {output}

    rm -f $tmpvcf
    """

SnakeMake BCFtools From line 33 of rules/demuxlet.smk

shell:
    "{input.tools}/filter_bam_file_for_popscle_dsc_pileup.sh {input.bam} {input.barcodes} {input.vcf} {output} > {log} 2>&1"

SnakeMake From line 69 of rules/demuxlet.smk

shell:
    """
    touch {log} && exec > {log} 2>&1

    out=$(dirname {output})/$(basename {output} .plp.gz)

    popscle dsc-pileup \
        --sam {input.bam} \
        --group-list {input.barcodes} \
        --vcf {input.vcf} \
        --out $out
    """

SnakeMake From line 84 of rules/demuxlet.smk

shell:
    """
    touch {log} && exec > {log} 2>&1

    out=$(dirname {output})/$(basename {output} .best)
    plp=$(dirname {input.plp})/$(basename {input.plp} .plp.gz)

    popscle demuxlet \
        --plp $plp \
        --group-list {input.barcodes} \
        --vcf {input.vcf} \
        --out $out
    """

SnakeMake From line 109 of rules/demuxlet.smk

shell:
    "juptyer nbconvert --to html {input} --output {output}"

SnakeMake From line 146 of rules/demuxlet.smk

shell:
    "STAR --runMode genomeGenerate --runThreadN {threads} --genomeDir {output} --genomeFastaFiles {input.fa} --sjdbGTFfile {input.gtf} --genomeSAindexNbases {params.genomeSAindexNbases}"

SnakeMake STAR From line 30 of rules/map_count.smk

script:
    "../scripts/STARsolo.py"

SnakeMake From line 73 of rules/map_count.smk

wrapper:
    "v1.23.5/bio/sambamba/index"

SnakeMake From line 87 of rules/map_count.smk

shell:
    "jupyter nbconvert --to html {input}"

SnakeMake From line 126 of rules/map_count.smk

shell:
    """
    # use CUDA if GPU is present
    if which nvidia-smi > /dev/null; then
        cuda="--cuda"
    else
        cuda=""
    fi
    expected=$(wc -l {input.filtered} | tr ' ' '\n' | head -n 1)

    # run cellbender
    mkdir -p $(dirname {output.raw})
    cellbender remove-background \
        --input $(dirname {input.raw}) \
        --output {output.raw} \
        --epochs 300 \
        --learning-rate 0.00001 \
        --z-dim 50 \
        --expected-cells $expected \
        --total-droplets-included {params.total_droplets_included} $cuda
    """

SnakeMake From line 151 of rules/map_count.smk

shell:
    """
    irescue \
        --bam {input.bam} \
        --tmpdir IRescue_tmp_{wildcards.run} \
        --genome {params.genome} \
        --threads {threads} \
        --outdir $(dirname {output[0]}) \
        --whitelist {input.whitelist}
    """

SnakeMake irescue From line 204 of rules/map_count.smk

shell:
    """
    git clone https://github.com/bvaldebenitom/SoloTE.git {output} 2> {log}

    # fix bad code on line 321 of SoloTE_1.08/SoloTE_pipeline.py
    sed -i 's/stdout.split/split/g' {output}/SoloTE_1.08/SoloTE_pipeline.py

    chmod +x {output}/SoloTE_1.08/convertRMOut_to_SoloTEinput.sh
    """

SnakeMake oschmod From line 223 of rules/map_count.smk

shell:
    "{input.solote}/SoloTE_1.08/convertRMOut_to_SoloTEinput.sh {input.rmsk_out} {output} > {log} 2>&1"

SnakeMake From line 244 of rules/map_count.smk

shell:
    """
    outdir=$(dirname $(dirname {output[0]}))

    python {input.solote}/SoloTE_1.08/SoloTE_pipeline.py \
        --threads {threads} \
        --bam {input.bam} \
        --teannotation {input.te_bed} \
        --outputprefix {wildcards.run} \
        --outputdir $outdir/ > {log} 2>&1

    mv {wildcards.run}_SoloTE.stats $outdir/{wildcards.run}_SoloTE_output/

    # clean up
    rm -rf $outdir/{wildcards.run}_SoloTE_temp
    """

SnakeMake From line 267 of rules/map_count.smk

shell:
    """
    tar -xf {input.ref10x} --wildcards '*genome.fa' '*genes.gtf'
    mv $(basename {input.ref10x} .tar.gz)/genes/genes.gtf {output.gtf}
    mv $(basename {input.ref10x} .tar.gz)/fasta/genome.fa {output.fa}
    gunzip -c {input.rmsk_out} > {output.rmsk_out}
    """

SnakeMake From line 25 of rules/ref.smk

shell:
    """
    if [[ {input} == *.gz ]]; then
        gunzip -c {input} > {output}
    else
        cp {input} {output}
    fi
    """

SnakeMake From line 56 of rules/ref.smk

__author__ = ["Michael Cuoco", "Joelle Faybishenko"]


import pegasus as pg
import sys
import pandas as pd
from pathlib import Path
from snakemake.shell import shell

sys.stderr = open(snakemake.log[0], "w")

print(f"opening file {snakemake.input[0]}")
data = pg.read_input(snakemake.input[0])

print("normalize data")
pg.normalize(data, norm_count=1e5)

outdir = Path(snakemake.output.norm).parent.parent
print(f"writing files to {outdir}")
pg.write_output(data, outdir, file_type="mtx")

for file in snakemake.output:
    print(f"decompressing {file}")
    shell(f"gunzip {file}.gz")

# fix barcodes and features files for proper reading by compass
df = pd.read_csv(snakemake.output.barcodes, sep="\t")
df["barcodekey"].to_csv(snakemake.output.barcodes, sep="\t", index=False, header=False)

df = pd.read_csv(snakemake.output.features, sep="\t")
df["featurekey"].to_csv(snakemake.output.features, sep="\t", index=False, header=False)


sys.stderr.close()

Python Snakemake Pandas pegasus From line 3 of scripts/prepare_data_for_compass.py

__author__ = "Michael Cuoco"

import tempfile, os
from pathlib import Path
from snakemake.shell import shell

# setup input fastqs and outdir
if type(snakemake.input.r1) is str:
    r1 = snakemake.input.r1
    r2 = snakemake.input.r2
else:
    assert len(snakemake.input.r1) == len(
        snakemake.input.r2
    ), "Unequal number of R1 and R2 files"
    r1 = ",".join(snakemake.input.r1)
    r2 = ",".join(snakemake.input.r2)

outdir = str(Path(snakemake.output.bam).parent) + "/"

# set readFilesIn and solo10xProtocol
print(f"10x chemistry: {snakemake.config['10x_chemistry']}")
if snakemake.config["10x_chemistry"] == "5prime":
    solo10xProtocol = "--soloBarcodeMate 1 --clip5pNbases 39 0 --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 10"
    readFilesIn = f"--readFilesIn {r1} {r2}"
elif snakemake.config["10x_chemistry"] == "3prime_v2":
    solo10xProtocol = "--soloUMIlen 16 --clipAdapterType CellRanger4"
    readFilesIn = f"--readFilesIn {r2} {r1}"
elif snakemake.config["10x_chemistry"] == "3prime_v3":
    solo10xProtocol = "--soloUMIlen 12 --clipAdapterType CellRanger4"
    readFilesIn = f"--readFilesIn {r2} {r1}"
else:
    raise ValueError("Invalid 10x protocol")

# set read command
print(f"Read 1 file(s): {r1}")
print(f"Read 2 file(s): {r2}")
if r1.endswith(".gz"):
    readcmd = "gunzip -c"
elif r1.endswith(".bz2"):
    readcmd = "bunzip2 -c"
elif r1.endswith(".fastq") or r1.endswith(".fq"):
    readcmd = ""
else:
    raise ValueError("Invalid read file")

# set solobarcode read length
if snakemake.config["STARsolo"].get("soloBarcodeReadLength"):
    soloBarcodeReadLength = "--soloBarcodeReadLength " + str(
        snakemake.config["STARsolo"]["soloBarcodeReadLength"]
    )
else:
    soloBarcodeReadLength = ""

# set soloMultiMappers
if snakemake.config["STARsolo"].get("soloMultiMappers"):
    soloMultiMappers = "--soloMultiMappers " + str(
        snakemake.config["STARsolo"]["soloMultiMappers"]
    )
else:
    soloMultiMappers = ""

if snakemake.config["STARsolo"].get("outSAMmultNmax"):
    outSAMmultNmax = "--outSAMmultNmax " + str(
        snakemake.config["STARsolo"]["outSAMmultNmax"]
    )
else:
    outSAMmultNmax = ""

outFilterMultimapNmax = snakemake.config["STARsolo"]["outFilterMultimapNmax"]
winAnchorMultimapNmax = snakemake.config["STARsolo"]["winAnchorMultimapNmax"]

mem = str(50e9)  # use 50 GB of RAM for sorting

with tempfile.TemporaryDirectory() as tmpdir:
    statvfs = os.statvfs(tmpdir)
    if statvfs.f_frsize * statvfs.f_bavail < 2e10:
        print(
            "WARNING: Less than 20 GB of free disk space detected at tmpdir location. STAR may fail due to lack of disk space. Set $TMPDIR to a location with more free space."
        )
    shell(
        "STAR "
        " --runThreadN {snakemake.threads}"
        " --genomeDir {snakemake.input.idx}"
        " --readFilesCommand {readcmd}"
        " {readFilesIn}"
        " --soloType CB_UMI_Simple"
        " {solo10xProtocol}"
        " {soloBarcodeReadLength}"
        " {soloMultiMappers}"
        " {outSAMmultNmax}"
        " --soloFeatures {snakemake.params.soloFeatures}"
        " --soloCellFilter EmptyDrops_CR"
        " --soloCBwhitelist {snakemake.input.whitelist}"
        " --soloOutFormatFeaturesGeneField3 -"
        " --outTmpDir {tmpdir}/STARtmp"
        " --soloOutFileNames outs genes.tsv barcodes.tsv matrix.mtx"
        " --outFilterMultimapNmax {outFilterMultimapNmax}"
        " --winAnchorMultimapNmax {winAnchorMultimapNmax}"
        " --limitBAMsortRAM {mem} "
        " --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM"
        " --outSAMtype BAM SortedByCoordinate"
        " --outFileNamePrefix {outdir}"
    )

Python Snakemake STAR From line 3 of scripts/STARsolo.py

__author__ = "Jan Forster"
__copyright__ = "Copyright 2021, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"


import os
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "sambamba index {snakemake.params.extra} -t {snakemake.threads} "
    "{snakemake.input[0]} {snakemake.output[0]} "
    "{log}"
)