Snakemake pipeline for single-cell RNA-seq analysis
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Setup
Testing
# generate 10x v3 scRNA-seq test data
cd .test/ngs-test-data
snakemake scrnaseq_10x_v3 --use-conda -c1 --conda-cleanup-pkgs cache
# test the workflow
cd ../..
snakemake all --use-conda -c1 --directory .test --show-failed-logs
Configuration
- Create two TSVs: (1) a "runsheet" that describes each 10x run and (2) a "patientsheet" that describes each patient
The runsheet should have the following columns
-
run_id
: a unique identifier for each run -
r1
: the path to the R1 fastq file -
r2
: the path to the R2 fastq file -
total_barcodes
: the total number of barcodes in the run -
cells_loaded
: the number of cells loaded into the run -
expected_cells
: the number of cells expected to be yielded in the run -
expected_multiplet_rate
: the expected multiplet rate of the run
The patientsheet should have the following columns
-
patient_id
: a unique identifier for each patient -
condition
: the condition of the patient -
Enter the paths to the runsheet and patientsheet in the
config.yaml
file. Specify an output directory.
outdir: <path to output directory>
runsheet: <path to runsheet>
patientsheet: <path to patientsheet>
Running the workflow
# run the workflow all the way through
# -c16 = use 16 cores, this can be changed to whatever you want
outdir="<path to output directory>"
snakemake all --use-conda -c16 --show-failed-logs --config outdir="$outdir"
To run until a certain step, use the
--until
flag
# run until a certain step
snakemake all --use-conda -c16 --show-failed-logs --config outdir="$outdir" --until <rule name>
# IE: stop after indexing the genome
snakemake all --use-conda -c16 --show-failed-logs --config outdir="$outdir" --until STARindex
# IE: stop after the STARsolo
snakemake all --use-conda -c16 --show-failed-logs --config outdir="$outdir" --until STARsolo
TODO:
-
[ ] Use different rules to run different stages of pipeline (instead of
all
, usemap_count
,compass
,preprocess
, etc.)
Code Snippets
8 9 10 11 12 | shell: """ python {input} install touch {output} """ |
28 29 | script: "../scripts/prepare_data_for_compass.py" |
39 40 41 42 43 | shell: """ compass --precache --species homo_sapiens touch {output} """ |
61 62 63 64 65 66 67 68 69 70 71 | shell: """ tmpdir=$(mktemp -d -p $(dirname {output})) compass \ --data-mtx {input.norm} {input.features} {input.barcodes} \ --output-dir $(dirname {output}) \ --microcluster-size 10 \ --num-processes {threads} \ --temp-dir $tmpdir \ --species homo_sapiens 2> {log} """ |
4 5 6 7 | shell: """ git clone https://github.com/aertslab/popscle_helper_tools.git {output} """ |
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | shell: """ touch {log} && exec > {log} 2>&1 source {input.tools}/filter_vcf_file_for_popscle.sh check_if_programs_exists tmpvcf=$(mktemp) trap 'rm -f $tmpvcf' EXIT # subset samples of interest bcftools view -s {params.samples} -Ou {input.vcf} > $tmpvcf {input.tools}/sort_vcf_same_as_bam.sh {input.bam} $tmpvcf \ | only_keep_snps \ | calculate_AF_AC_AN_values_based_on_genotype_info \ | filter_out_mutations_missing_genotype_for_one_or_more_samples \ | filter_out_mutations_homozygous_reference_in_all_samples \ | filter_out_mutations_homozygous_in_all_samples > {output} rm -f $tmpvcf """ |
69 70 | shell: "{input.tools}/filter_bam_file_for_popscle_dsc_pileup.sh {input.bam} {input.barcodes} {input.vcf} {output} > {log} 2>&1" |
84 85 86 87 88 89 90 91 92 93 94 95 | shell: """ touch {log} && exec > {log} 2>&1 out=$(dirname {output})/$(basename {output} .plp.gz) popscle dsc-pileup \ --sam {input.bam} \ --group-list {input.barcodes} \ --vcf {input.vcf} \ --out $out """ |
109 110 111 112 113 114 115 116 117 118 119 120 121 | shell: """ touch {log} && exec > {log} 2>&1 out=$(dirname {output})/$(basename {output} .best) plp=$(dirname {input.plp})/$(basename {input.plp} .plp.gz) popscle demuxlet \ --plp $plp \ --group-list {input.barcodes} \ --vcf {input.vcf} \ --out $out """ |
146 147 | shell: "juptyer nbconvert --to html {input} --output {output}" |
30 31 | shell: "STAR --runMode genomeGenerate --runThreadN {threads} --genomeDir {output} --genomeFastaFiles {input.fa} --sjdbGTFfile {input.gtf} --genomeSAindexNbases {params.genomeSAindexNbases}" |
73 74 | script: "../scripts/STARsolo.py" |
87 88 | wrapper: "v1.23.5/bio/sambamba/index" |
126 127 | shell: "jupyter nbconvert --to html {input}" |
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | shell: """ # use CUDA if GPU is present if which nvidia-smi > /dev/null; then cuda="--cuda" else cuda="" fi expected=$(wc -l {input.filtered} | tr ' ' '\n' | head -n 1) # run cellbender mkdir -p $(dirname {output.raw}) cellbender remove-background \ --input $(dirname {input.raw}) \ --output {output.raw} \ --epochs 300 \ --learning-rate 0.00001 \ --z-dim 50 \ --expected-cells $expected \ --total-droplets-included {params.total_droplets_included} $cuda """ |
204 205 206 207 208 209 210 211 212 213 | shell: """ irescue \ --bam {input.bam} \ --tmpdir IRescue_tmp_{wildcards.run} \ --genome {params.genome} \ --threads {threads} \ --outdir $(dirname {output[0]}) \ --whitelist {input.whitelist} """ |
223 224 225 226 227 228 229 230 231 | shell: """ git clone https://github.com/bvaldebenitom/SoloTE.git {output} 2> {log} # fix bad code on line 321 of SoloTE_1.08/SoloTE_pipeline.py sed -i 's/stdout.split/split/g' {output}/SoloTE_1.08/SoloTE_pipeline.py chmod +x {output}/SoloTE_1.08/convertRMOut_to_SoloTEinput.sh """ |
244 245 | shell: "{input.solote}/SoloTE_1.08/convertRMOut_to_SoloTEinput.sh {input.rmsk_out} {output} > {log} 2>&1" |
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 | shell: """ outdir=$(dirname $(dirname {output[0]})) python {input.solote}/SoloTE_1.08/SoloTE_pipeline.py \ --threads {threads} \ --bam {input.bam} \ --teannotation {input.te_bed} \ --outputprefix {wildcards.run} \ --outputdir $outdir/ > {log} 2>&1 mv {wildcards.run}_SoloTE.stats $outdir/{wildcards.run}_SoloTE_output/ # clean up rm -rf $outdir/{wildcards.run}_SoloTE_temp """ |
25 26 27 28 29 30 31 | shell: """ tar -xf {input.ref10x} --wildcards '*genome.fa' '*genes.gtf' mv $(basename {input.ref10x} .tar.gz)/genes/genes.gtf {output.gtf} mv $(basename {input.ref10x} .tar.gz)/fasta/genome.fa {output.fa} gunzip -c {input.rmsk_out} > {output.rmsk_out} """ |
56 57 58 59 60 61 62 63 | shell: """ if [[ {input} == *.gz ]]; then gunzip -c {input} > {output} else cp {input} {output} fi """ |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | __author__ = ["Michael Cuoco", "Joelle Faybishenko"] import pegasus as pg import sys import pandas as pd from pathlib import Path from snakemake.shell import shell sys.stderr = open(snakemake.log[0], "w") print(f"opening file {snakemake.input[0]}") data = pg.read_input(snakemake.input[0]) print("normalize data") pg.normalize(data, norm_count=1e5) outdir = Path(snakemake.output.norm).parent.parent print(f"writing files to {outdir}") pg.write_output(data, outdir, file_type="mtx") for file in snakemake.output: print(f"decompressing {file}") shell(f"gunzip {file}.gz") # fix barcodes and features files for proper reading by compass df = pd.read_csv(snakemake.output.barcodes, sep="\t") df["barcodekey"].to_csv(snakemake.output.barcodes, sep="\t", index=False, header=False) df = pd.read_csv(snakemake.output.features, sep="\t") df["featurekey"].to_csv(snakemake.output.features, sep="\t", index=False, header=False) sys.stderr.close() |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | __author__ = "Michael Cuoco" import tempfile, os from pathlib import Path from snakemake.shell import shell # setup input fastqs and outdir if type(snakemake.input.r1) is str: r1 = snakemake.input.r1 r2 = snakemake.input.r2 else: assert len(snakemake.input.r1) == len( snakemake.input.r2 ), "Unequal number of R1 and R2 files" r1 = ",".join(snakemake.input.r1) r2 = ",".join(snakemake.input.r2) outdir = str(Path(snakemake.output.bam).parent) + "/" # set readFilesIn and solo10xProtocol print(f"10x chemistry: {snakemake.config['10x_chemistry']}") if snakemake.config["10x_chemistry"] == "5prime": solo10xProtocol = "--soloBarcodeMate 1 --clip5pNbases 39 0 --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 10" readFilesIn = f"--readFilesIn {r1} {r2}" elif snakemake.config["10x_chemistry"] == "3prime_v2": solo10xProtocol = "--soloUMIlen 16 --clipAdapterType CellRanger4" readFilesIn = f"--readFilesIn {r2} {r1}" elif snakemake.config["10x_chemistry"] == "3prime_v3": solo10xProtocol = "--soloUMIlen 12 --clipAdapterType CellRanger4" readFilesIn = f"--readFilesIn {r2} {r1}" else: raise ValueError("Invalid 10x protocol") # set read command print(f"Read 1 file(s): {r1}") print(f"Read 2 file(s): {r2}") if r1.endswith(".gz"): readcmd = "gunzip -c" elif r1.endswith(".bz2"): readcmd = "bunzip2 -c" elif r1.endswith(".fastq") or r1.endswith(".fq"): readcmd = "" else: raise ValueError("Invalid read file") # set solobarcode read length if snakemake.config["STARsolo"].get("soloBarcodeReadLength"): soloBarcodeReadLength = "--soloBarcodeReadLength " + str( snakemake.config["STARsolo"]["soloBarcodeReadLength"] ) else: soloBarcodeReadLength = "" # set soloMultiMappers if snakemake.config["STARsolo"].get("soloMultiMappers"): soloMultiMappers = "--soloMultiMappers " + str( snakemake.config["STARsolo"]["soloMultiMappers"] ) else: soloMultiMappers = "" if snakemake.config["STARsolo"].get("outSAMmultNmax"): outSAMmultNmax = "--outSAMmultNmax " + str( snakemake.config["STARsolo"]["outSAMmultNmax"] ) else: outSAMmultNmax = "" outFilterMultimapNmax = snakemake.config["STARsolo"]["outFilterMultimapNmax"] winAnchorMultimapNmax = snakemake.config["STARsolo"]["winAnchorMultimapNmax"] mem = str(50e9) # use 50 GB of RAM for sorting with tempfile.TemporaryDirectory() as tmpdir: statvfs = os.statvfs(tmpdir) if statvfs.f_frsize * statvfs.f_bavail < 2e10: print( "WARNING: Less than 20 GB of free disk space detected at tmpdir location. STAR may fail due to lack of disk space. Set $TMPDIR to a location with more free space." ) shell( "STAR " " --runThreadN {snakemake.threads}" " --genomeDir {snakemake.input.idx}" " --readFilesCommand {readcmd}" " {readFilesIn}" " --soloType CB_UMI_Simple" " {solo10xProtocol}" " {soloBarcodeReadLength}" " {soloMultiMappers}" " {outSAMmultNmax}" " --soloFeatures {snakemake.params.soloFeatures}" " --soloCellFilter EmptyDrops_CR" " --soloCBwhitelist {snakemake.input.whitelist}" " --soloOutFormatFeaturesGeneField3 -" " --outTmpDir {tmpdir}/STARtmp" " --soloOutFileNames outs genes.tsv barcodes.tsv matrix.mtx" " --outFilterMultimapNmax {outFilterMultimapNmax}" " --winAnchorMultimapNmax {winAnchorMultimapNmax}" " --limitBAMsortRAM {mem} " " --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM" " --outSAMtype BAM SortedByCoordinate" " --outFileNamePrefix {outdir}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | __author__ = "Jan Forster" __copyright__ = "Copyright 2021, Jan Forster" __email__ = "j.forster@dkfz.de" __license__ = "MIT" import os from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "sambamba index {snakemake.params.extra} -t {snakemake.threads} " "{snakemake.input[0]} {snakemake.output[0]} " "{log}" ) |
Support
- Future updates
Related Workflows





