Snakemake workflow for generating gene expression counts from RNA-sequencing data.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Introduction
Author: Anže Lovše (@AnzeLovse)
This is a Snakemake workflow for generating gene expression counts from RNA-sequencing data and can optionally be used to run differential expression analysis as well. The workflow handles both single-end and paired-end sequencing data. We used the workflow for analyzing sequencing data from Bacillus thuringiensis serovar israelensis as a part of my bachelor thesis.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository.
Overview
The standard workflow performs the following steps:
-
Trim reads with Cutadapt.
-
Build a genome index and align reads with Bowtie2.
-
Sort and index reads with Samtools.
-
Generate gene expression counts with featureCounts.
-
Merge all counts into a single file.
-
Normalize counts (CPM, TPM) with rnanorm
-
(optionally) Differential expression with DESeq2
Additionally the workflow includes quality control steps:
-
Read quality reports before and after trimming using FASTQC.
-
Read alignment statistics based on Samtools and Bowtie2 reports.
-
featureCounts statistics report.
-
Aggregated report made with MultiQC
and visualization helper functions:
-
Genome coverage made with genomeCoverageBed.
-
Calculate FASTA index.
-
Create JSON file with coverage data.
-
Convert GTF to BED.
-
Generate JSON file containing genomic features.
Steps are shown in the following graph. Each rectangle represents a step of the analysis. Names of the rules are written in italic. Blue rectangles represent main steps, yellow ones show QC steps and the gray ones represent helper functions.
How to use
Prerequisites
Download Docker or install Miniconda. If you use Miniconda choose Python 3.8 (or higher) for your operating system and follow the installation instructions.
conda --version # ideally 4.9.xx or higher
python --version # ideally 3.8.xx or higher
Step 1: Obtain a copy of this workflow
If you simply want to use this workflow, download and extract the latest release or clone it with:
git clone https://github.com/AnzeLovse/mag.git
If you intend to modify and further develop this workflow, fork this repository. Please consider providing any generally applicable modifications via a pull request.
Step 2: Configure workflow
Configure the workflow according to your needs via editing the files in the
config/
folder. Adjust
config.yaml
to configure the workflow execution,
samples.tsv
to specify your experiment setup and
units.tsv
to specify sequencing runs.
Step 3: Pull Docker image (recommended) or create Conda environment
For Docker image pull the latest (currently 0.1.5) image from Docker Hub:
docker pull alovse/rnaseq-mag
Conda environment only supports quantification and not differential expressions. To set it up use:
conda install -y -c conda-forge mamba
mamba create -q -y -c conda-forge -c bioconda -n snakemake snakemake python=3.8
conda env update -n snakemake --file /var/cache/build/environment.yaml
conda run -n snakemake python -m pip install rnanorm
Step 4: Execute workflow
Create a results directory and copy the configuration in it:
mkdir results_timecourse # create the directory
cp config results_timecourse # copy the configuration
Transfer reference genomes and reads into a subdirectory of results directory. The results directory should follow this structure. Please note that
0.72.0/bio/fastqc
wrapper only supports
.fastq
files and not
.fq
files.
├── config
│ ├── config.yaml
│ ├── samples.tsv
│ └── units.tsv
├── data
│ ├── reference.fasta
│ ├── annotation.gtf
│ └── reads.fastq
To run the workflow with Docker use:
cd mag # open the directory
docker run -m `$M` -v $(pwd):/data mag:latest bin/bash -c "source activate snakemake; cd data; snakemake --cores `$N` `$STEP`"
using
$N
cores,
$M
of memory and the desired step
$STEP
(e.g.
all
or
all_with_de
)
Alternatively you can use Conda.
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake -n all
Execute the workflow locally via
snakemake --cores $N all
using
$N
cores and choosing the desired
$STEP
(e.g.
all
or
all_with_de
)
See the Snakemake documentation for further details.
Step 5: Investigate results
After successful execution, you can inspect the interactive HTML report made with MultiQC. It is located in
results_timecourse/qc/multiqc.html
For more information about MultiQC please see the official webpage and tool documentation.
Directory sturcture of the results folder is shown below.
├── bedgraphs
├── config
├── counts
├── data
├── deseq2
├── index
├── logs
│ ├── bedgraphs
│ ├── bowtie2
│ ├── bowtie2_build
│ ├── cutadapt
│ ├── deseq2
│ ├── fastqc
│ ├── fastqc_posttrim
│ ├── feature_counts
│ ├── multiqc.log
│ └── samtools_stats
├── mapped_reads
├── qc
│ ├── fastqc
│ ├── fastqc_posttrim
│ ├── feature_counts
│ ├── multiqc.html
│ ├── multiqc_data
│ └── samtools_stats
├── sorted_reads
├── trimmed
└── visualisations
Step 6: Commit changes
Whenever you change something, don't forget to commit the changes back to your Github copy of the repository:
git commit -a
git push
Step 7: Obtain updates from upstream
Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.
-
Once, register the upstream repository in your local copy:
git remote add -f upstream git@github.com:AnzeLovse/mag.git
orgit remote add -f upstream https://github.com/AnzeLovse/mag.git
if you do not have setup ssh keys. -
Update the upstream version:
git fetch upstream
. -
Create a diff with the current version:
git diff HEAD upstream/master workflow > upstream-changes.diff
. -
Investigate the changes:
vim upstream-changes.diff
. -
Apply the modified diff via:
git apply upstream-changes.diff
. -
Carefully check whether you need to update the config files:
git diff HEAD upstream/master config
. If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.
Step 8: Contribute back
In case you have also changed or added steps, please consider contributing them back to the original repository:
-
Fork the original repo to a personal or lab account.
-
Clone the fork to your local system, to a different place than where you ran your analysis.
-
Copy the modified files from your analysis to the clone of your fork, e.g.,
cp -r workflow path/to/fork
. Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary. -
Commit and push your changes to your fork.
-
Create a pull request against the original repository.
Code Snippets
13 14 | wrapper: "0.72.0/bio/bowtie2/align" |
24 25 | shell: "samtools sort -@ {params.threads} -T sorted_reads/{wildcards.sample}-t{wildcards.time}-{wildcards.rep} -O bam -o {output} {input}" |
35 36 | shell: "samtools index -@ {params.threads} {input}" |
15 16 17 18 19 20 21 22 | shell: "featureCounts " "{params.extra} " "-a {params.annotation} " "-o {output.counts} " "-T {threads} " "{input.bam} > {log} 2>&1; " "mv \"{output.counts}.summary\" {output.summary}" |
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | run: indices = ["Geneid", "Chr", "Start", "End", "Strand","Length"] frames = [ pd.read_csv(fp, sep="\t", skiprows=1, index_col=indices) for fp in input ] merged = pd.concat(frames, axis=1) merged = merged.rename(columns=lambda c: Path(c).stem) merged.to_csv(output.complete, sep="\t", index=True) # Save counts compatible with rnanorm merged = merged.reset_index(level="Geneid") merged = merged.rename(columns={"Geneid":"FEATURE_ID"}) merged.to_csv(output.raw, sep="\t", index=False) lengths = merged.reset_index(level="Length") lengths[["FEATURE_ID", "Length"]].to_csv(output.lengths, sep="\t", index=False) |
60 61 | shell: "rnanorm {input.counts} --cpm-output={output.cpm} --tpm-output={output.tpm} --gene-lengths={input.lengths}" |
16 17 | script: "../scripts/deseq2.R" |
14 15 | wrapper: "0.72.0-5-gd0d054c/bio/bowtie2/build" |
10 11 12 13 14 15 16 17 18 | wrapper: "0.72.0/bio/fastqc" # TODO update version to correctly remove .fq.gz ending rule fastqc_post_trim: input: "trimmed/{sample}-t{time}-{rep}_{pair}.fastq.gz", output: html="qc/fastqc_posttrim/{sample}-t{time}-{rep}_R{pair}.html", zip="qc/fastqc_posttrim/{sample}-t{time}-{rep}_R{pair}_fastqc.zip" |
22 23 24 25 26 27 28 | wrapper: "0.72.0/bio/fastqc" # TODO update version to correctly remove .fq.gz ending rule samtools_stats: input: bam="sorted_reads/{sample}-t{time}-{rep}.bam", idx="sorted_reads/{sample}-t{time}-{rep}.bam.bai" |
33 34 | shell: "samtools stats {input.bam} 1> {output} 2> {log}" |
46 47 | wrapper: "0.72.0/bio/multiqc" |
15 16 | wrapper: "0.72.0/bio/cutadapt/pe" |
30 31 | wrapper: "0.72.0/bio/cutadapt/se" |
15 16 17 18 | shell: "cat {input.annotation} | " "awk 'OFS=\"\\t\" {{if ($3==\"gene\") {{print $1,$4-1,$5,$10,$16,$7}}}}' | " "tr -d '\";' > {output.complete}" |
27 28 29 30 31 32 33 34 35 36 37 38 39 40 | run: annotation = pd.read_csv( input.annotation, sep="\t", header=None, names=["chr", "start", "end", "name", "type", "strand"] ) annotation = annotation.rename(columns = {"chr":"block_id"}) pos = prepare_annotation(annotation, "+", "rgb(255,51,51)") neg = prepare_annotation(annotation, "-", "rgb(255,51,51)") joined = {"positive": pos, "negative": neg} with open(output.json, "w") as handle: json.dump(joined, handle, indent=4) |
53 54 55 56 57 58 59 | shell: "genomeCoverageBed " "-ibam {input.bam} " "{params.paired} " "-bga " "-strand {wildcards.strand} " "> {output.coverage}" |
68 69 | shell: "samtools faidx {input}" |
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | run: fai = pd.read_csv( input.fai, sep="\t", header=None, names=["chr", "len"], usecols=[0,1], index_col=0 ) lengths = dict(zip(fai.index, fai.len)) coverage = {"length":lengths, "positive":{}, "negative":{}} coverage_pos = pd.read_csv(input.coverage_pos, sep="\t", header=None, names=["block_id", "start", "end", "value"]) coverage_neg = pd.read_csv(input.coverage_neg, sep="\t", header=None, names=["block_id", "start", "end", "value"]) for chrom in lengths.keys(): pos = coverage_pos.loc[coverage_pos["block_id"] == chrom] neg = coverage_neg.loc[coverage_neg["block_id"] == chrom] coverage["positive"][chrom] = json.loads(pos.to_json(orient="records")) coverage["negative"][chrom] = json.loads(neg.to_json(orient="records")) with open(output.json, "w") as handle: json.dump(coverage, handle, indent=4) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("DESeq2") counts <- read.csv(snakemake@input[["counts"]], sep="\t", row.names="Geneid") counts <- subset(counts, select = -c(Chr, Start, End, Strand, Length)) samples <- read.csv(snakemake@params[["samples"]], sep="\t", stringsAsFactors = TRUE) units <- read.csv(snakemake@params[["units"]], sep="\t", stringsAsFactors = TRUE) coldata <- merge(unique(samples), units, by='sample') coldata <- subset(coldata, select=-c(fq1, fq2)) coldata[] <- lapply(coldata, as.factor) rownames(coldata) <- paste(coldata$sample, paste("t",coldata$time,sep=""), coldata$replicate, sep=".") # Speciffic steps needed for our analysis #-------------------------------------------------- counts$G_M.t0.1 <- counts$G.t0.1 counts$G_M.t0.2 <- counts$G.t0.2 zero_rows <- data.frame(rbind( c("G_M", "mitomycin", 1, 0), c("G_M", "mitomycin", 2, 0) )) colnames(zero_rows) <- colnames(coldata) rownames(zero_rows) <- c("G_M.t0.1", "G_M.t0.2") coldata <- rbind(coldata, zero_rows) #-------------------------------------------------- # Filter rows with less than 5 counts in at least 2 samples filter <- apply(counts, 1, function(x) length(x[x>5])>=2) filtered <- counts[filter,] # Create a DESeq data set with a full desing. ddsFull <- DESeqDataSetFromMatrix( countData = filtered, colData = coldata, design = as.formula(snakemake@params[["model"]])) # Use Likelihood-ratio test on a reduced model. dds <- DESeq(ddsFull, test="LRT", reduced = as.formula(snakemake@params[["reduced_model"]])) saveRDS(dds, file = snakemake@output[["rds"]]) alpha <- as.numeric(snakemake@params[["alpha"]]) result <- results(dds, alpha=alpha) # Construct a list of shrunken LFC for the coefficients of the model. logfc.list <- list() logfc.list[[1]] <- data.frame(result) for (i in 2:length(resultsNames(dds))) { coeff = resultsNames(dds)[i] res.t <- results(dds, name=coeff, test="Wald", alpha = alpha) out.t <- as.data.frame(res.t) out.t <- cbind(geneid = rownames(out.t), out.t) write.table( out.t, file=paste("deseq2/", coeff, ".tsv", sep=""), sep="\t", quote=FALSE, row.names=FALSE ) res_shrunken <- lfcShrink(dds, coef=coeff, type="apeglm", res = res.t) out.table <- as.data.frame(res_shrunken) out.table <- cbind(geneid = rownames(out.table), out.table) write.table( out.table, file=paste("deseq2/", coeff, "_shrunken.tsv", sep=""), sep="\t", quote=FALSE, row.names=FALSE ) svg(paste("deseq2/", coeff, "ma_plot.svg", sep="")) plotMA(result, ylim=c(-2,2)) dev.off() # Add the shrunken log2FC to the list. lfc <- res_shrunken[, "log2FoldChange", drop=FALSE] colnames(lfc) <- c(paste(coeff, "sLFC", sep=".")) logfc.list[[i]] <- lfc } # Write combined results in the main output. out.result <- do.call(cbind, logfc.list) out.result <- out.result[order(out.result$padj),] out.result <- cbind(geneid = rownames(out.result), out.result) write.table( out.result, file=snakemake@output[["table"]], sep="\t", quote=FALSE, row.names=FALSE ) # Use variance stabilising transformation and then plot PCA. vsd <- vst(ddsFull, blind=FALSE) svg(snakemake@output[["pca_plot"]]) plotPCA(vsd, intgroup=c("condition", "time"), ntop = 500) dev.off() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | __author__ = "Daniel Standage" __copyright__ = "Copyright 2020, Daniel Standage" __email__ = "daniel.standage@nbacc.dhs.gov" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) indexbase = snakemake.output[0].replace(".1.bt2", "") shell( "bowtie2-build --threads {snakemake.threads} {snakemake.params.extra} " "{snakemake.input.reference} {indexbase}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) n = len(snakemake.input.sample) assert ( n == 1 or n == 2 ), "input->sample must have 1 (single-end) or 2 (paired-end) elements." if n == 1: reads = "-U {}".format(*snakemake.input.sample) else: reads = "-1 {} -2 {}".format(*snakemake.input.sample) shell( "(bowtie2 --threads {snakemake.threads} {extra} " "-x {snakemake.params.index} {reads} " "| samtools view -Sbh -o {snakemake.output[0]} -) {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell n = len(snakemake.input) assert n == 2, "Input must contain 2 (paired-end) elements." extra = snakemake.params.get("extra", "") adapters = snakemake.params.get("adapters", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) assert ( extra != "" or adapters != "" ), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='." shell( "cutadapt" " {snakemake.params.adapters}" " {snakemake.params.extra}" " -o {snakemake.output.fastq1}" " -p {snakemake.output.fastq2}" " -j {snakemake.threads}" " {snakemake.input}" " > {snakemake.output.qc} {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell n = len(snakemake.input) assert n == 1, "Input must contain 1 (single-end) element." extra = snakemake.params.get("extra", "") adapters = snakemake.params.get("adapters", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) assert ( extra != "" or adapters != "" ), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='." shell( "cutadapt" " {snakemake.params.adapters}" " {snakemake.params.extra}" " -j {snakemake.threads}" " -o {snakemake.output.fastq}" " {snakemake.input[0]}" " > {snakemake.output.qc} {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
Support
- Future updates
Related Workflows





