RNA-seq workflow for Snakemake based on STAR and featureCounts.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Snakemake-exome
|Snakemake| |Wercker|
This is a Snakemake workflow for generating gene expression counts from RNA-sequencing data using STAR and featureCounts (from the subread package). The workflow is designed to handle both single-end and paired-end sequencing data, as well as sequencing data from multiple lanes. Processing of patient-derived xenograft (PDX) samples is also supported, by using disambiguate to separate graft/host sequence reads.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, if available, its DOI (see above).
.. |Snakemake| image:: https://img.shields.io/badge/snakemake-≥3.13.3-brightgreen.svg :target: https://snakemake.bitbucket.io
.. |Wercker| image:: https://app.wercker.com/status/9c3bfb4aa4dbffc027b7a0fcfc00cc57/s/develop :target: https://app.wercker.com/project/byKey/9c3bfb4aa4dbffc027b7a0fcfc00cc57
Overview
The standard (non-PDX) workflow essentially performs the following steps:
-
Cutadapt is used to trim the input reads for adapters and/or poor-quality base calls.
-
The trimmed reads are aligned to the reference genome using STAR.
-
The resulting alignments are sorted and indexed using sambamba.
-
featureCounts is used to generate gene expression counts.
-
The (per sample) counts are merged into a single count file.
-
The merged counts are normalized for differences in sequencing depth (using DESeq's median-of-ratios approach) and log-transformed.
This results in the following dependency graph:
.. image:: https://jrderuiter.github.io/snakemake-rnaseq/_images/dag.svg
The PDX workflow is a slightly modified version of the standard workflow, which aligns the reads to two reference genome (the host and graft reference genomes) and uses disambiguate_ to remove sequences originating from the host organism. See the documentation for more details.
Documentation
Documentation is available at: http://jrderuiter.github.io/snakemake-rnaseq.
License
This software is released under the MIT license.
.. _disambiguate: https://github.com/AstraZeneca-NGS/disambiguate
Code Snippets
47 48 | wrapper: "0.17.0/bio/star/align" |
65 66 | wrapper: "0.17.0/bio/star/align" |
78 79 | wrapper: "0.17.0/bio/sambamba/sort" |
101 102 | wrapper: "0.17.0/bio/samtools/merge" |
119 120 | wrapper: "0.17.0/bio/ngs-disambiguate" |
132 133 | wrapper: "0.17.0/bio/sambamba/sort" |
141 142 | wrapper: "0.17.0/bio/samtools/index" |
159 160 | wrapper: "0.17.0/bio/star/align" |
172 173 | wrapper: "0.17.0/bio/sambamba/sort" |
195 196 | wrapper: "0.17.0/bio/samtools/merge" |
204 205 | wrapper: "0.17.0/bio/samtools/index" |
59 60 61 62 63 64 65 66 67 | wrapper: "file://" + path.join(workflow.basedir, "wrappers/subread/feature_counts") rule merge_counts: input: expand("counts/per_sample/{sample}.txt", sample=get_samples()) output: "counts/merged.txt" |
68 69 70 71 72 73 74 75 76 77 78 79 | run: # Merge count files. frames = (pd.read_csv(fp, sep="\t", skiprows=1, index_col=list(range(6))) for fp in input) merged = pd.concat(frames, axis=1) # Extract sample names. merged = merged.rename( columns=lambda c: path.splitext(path.basename(c))[0]) merged.to_csv(output[0], sep="\t", index=True) |
87 88 89 90 | run: counts = pd.read_csv(input[0], sep="\t", index_col=list(range(6))) norm_counts = np.log2(normalize_counts(counts) + 1) norm_counts.to_csv(output[0], sep="\t", index=True) |
14 15 | wrapper: "0.17.0/bio/cutadapt/pe" |
27 28 | wrapper: "0.17.0/bio/cutadapt/se" |
63 64 | shell: "cp {input} {output}" |
34 35 | wrapper: "0.17.0/bio/multiqc" |
46 47 | wrapper: "0.17.0/bio/fastqc" |
55 56 | wrapper: "0.17.0/bio/samtools/stats" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell n = len(snakemake.input) assert n == 2, "Input must contain 2 (paired-end) elements." log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "cutadapt" " {snakemake.params}" " -o {snakemake.output.fastq1}" " -p {snakemake.output.fastq2}" " {snakemake.input}" " > {snakemake.output.qc} {log}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "cutadapt" " {snakemake.params}" " -o {snakemake.output.fastq}" " {snakemake.input[0]}" " > {snakemake.output.qc} {log}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) split_ind = 2 if base.endswith(".gz") else 1 base = ".".join(base.split(".")[:-split_ind]) return base # Run fastqc. output_dir = path.dirname(snakemake.output.html) shell("fastqc {snakemake.params} --quiet " "--outdir {output_dir} {snakemake.input[0]}") # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(output_dir, output_base + "_fastqc.html") zip_path = path.join(output_dir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path} {snakemake.output.html}") if snakemake.output.zip != zip_path: shell("mv {zip_path} {snakemake.output.zip}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell # Extract arguments. prefix = snakemake.params.get("prefix", None) extra = snakemake.params.get("extra", "") output_dir = path.dirname(snakemake.output.a_ambiguous) log = snakemake.log_fmt_shell(stdout=False, stderr=True) # If prefix is not given, we use the summary path to derive the most # probable sample name (as the summary path is least likely to contain) # additional suffixes. This is better than using a random id as prefix, # the prefix is also used as the sample name in the summary file. if prefix is None: prefix = path.splitext(path.basename(snakemake.output.summary))[0] # Run command. shell( "ngs_disambiguate" " {extra}" " -o {output_dir}" " -s {prefix}" " -a {snakemake.params.algorithm}" " {snakemake.input.a}" " {snakemake.input.b}") # Move outputs into expected positions. output_base = path.join(output_dir, prefix) output_map = { output_base + ".ambiguousSpeciesA.bam": snakemake.output.a_ambiguous, output_base + ".ambiguousSpeciesB.bam": snakemake.output.b_ambiguous, output_base + ".disambiguatedSpeciesA.bam": snakemake.output.a_disambiguated, output_base + ".disambiguatedSpeciesB.bam": snakemake.output.b_disambiguated, output_base + "_summary.txt": snakemake.output.summary } for src, dest in output_map.items(): if src != dest: shell('mv {src} {dest}') |
1 2 3 4 5 6 7 8 9 10 11 12 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" import os from snakemake.shell import shell shell( "sambamba sort {snakemake.params} -t {snakemake.threads} " "-o {snakemake.output[0]} {snakemake.input[0]}") |
1 2 3 4 5 6 7 8 9 10 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell shell("samtools merge --threads {snakemake.threads} {snakemake.params} " "{snakemake.output[0]} {snakemake.input}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") region = snakemake.params.get("region", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell("samtools stats {extra} {snakemake.input}" " {region} > {snakemake.output} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) n = len(snakemake.input.sample) assert n == 1 or n == 2, "input->sample must have 1 (single-end) or 2 (paired-end) elements." if snakemake.input.sample[0].endswith(".gz"): readcmd = "--readFilesCommand zcat" else: readcmd = "" outprefix = os.path.dirname(snakemake.output[0]) + "/" shell( "STAR " "{snakemake.params.extra} " "--runThreadN {snakemake.threads} " "--genomeDir {snakemake.params.index} " "--readFilesIn {snakemake.input.sample} " "{readcmd} " "--outSAMtype BAM Unsorted " "--outFileNamePrefix {outprefix} " "--outStd Log " "{log}") |
Support
- Future updates
Related Workflows





