Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Introduction
cluster_rnaseq is a Snakemake pipeline that performs a comprehensive RNA-seq analysis, covering from the basic steps (QC, alignment, quantification) to the more advanced downstream analyses (diferential expresion).
The pipeline makes extensive use of Snakemake's integration with the conda package manager, to automatically take care of software requirements and dependencies.
We've built cluster_rnaseq for flexibility, with the aim of allowing the user to adjust the pipeline to different experiments using configuration parameters, including setting the pipeline to align or quantify with different tools. When in doubt, the default parameters were calculated to offer a good starting configuration.
Workflow overview
This is a schema of the complete workflow. Certain parts may be skipped depending on input data and chosen configuration.
Authors
-
Daniel Cerdán-Vélez
-
María José Jiménez-Santos
-
Santiago García-Martín
Setup
The setup of the pipeline consists of the modifying two configuration files, setting the desired parameters and the location of the input/output files; and two other files to know which samples enter the differential expression if it is going to be carried out, and its design matrix. A general description of these files follows. See the Usage section for more details.
Configuration files
-
config.yaml contains all pipeline parameters.
-
units.tsv contains information on the samples to be analysed and their paths.
-
samples.tsv : Indicates whether a sample is included in the differential expression.
-
designmatrix.tsv : Indicates the experimental conditions of each sample.
Input files
- raw data in gzip compressed FASTQ files
Usage
1. Set up the environment
cluster_rnaseq requires the conda package manager in order to work. Please install conda by following the bioconda installation instructions . In addition, of course, it is essential to install Snakemake; following the steps in the Snakemake documentation .
2. Download cluster_rnaseq repository from GitHub.
Use git clone command to create a local copy.
git clone https://github.com/cnio-bu/cluster_rnaseq.git
3. Configure the pipeline.
Before executing the pipeline, the users must configure it according to their samples. To do this, they must fill these files:
TIP: different analysis can be run using just one cloned repository. This is achieved by changing the outdir and logdir in the configuration file. Also different parameters values can be used in the different analysis.
a. config.yaml
This is the pipeline configuration file, where you can tune all the available parameters to customise your RNAseq analysis. Here the aligner and quantifier to be used are indicated, as well as the necessary input or output file paths and a resource management section for each rule.
The example file (
template_config.yaml
) features extensive inline documentation. Rename it to
config.yaml
to use it during your execution.
b. units.tsv
This file is used to configure the FASTQ input files.
An example file ( template_units.tsv ) is included in the repository.
Rename it to
units.tsv
and edit its contents according to the following table:
Field name | Description |
---|---|
sample | Sample name (must match the sample name specified in samples.tsv ). |
lane | Lane identifier from which the samples will be concatenated. |
fq1 | Path to FASTQ file for read 1. |
fq2 | Path to FASTQ file for read 2. |
md5_fq1 | MD5 hash for the file in 'fq1'. |
md5_fq2 | MD5 hash for the file in 'fq2'. |
The first three columns, "sample", "lane" and "fq1", are mandatory and defines the sample name, lane and path for each sample. The other columns are optionals and must be filled is the experiment is paired-end ("fq2") or if you have the MD5 hashes ("md5_fq1" and "md5_fq2").
c. samples.tsv
This table contains the name of each sample and an indication to include it (if diffexp = True) or exclude it (if diffexp = False) from the differential expression analysis.
An example file (
template_samples.tsv)
is included in the repository. Rename it to
samples.tsv
and edit its contents.
d. designmatrix.tsv
This is the table with the experimental conditions of each sample. The control condition must be preceded with an asterisk. Extra columns can be added as batch variables to correct the batch effect.
An example file (
template_designmatrix.tsv)
is included in the repository. Rename it to
designmatrix.tsv
and edit its contents before performing the differential expression.
4. Create the Conda environments.
To run the pipeline, the user needs to create the conda environments first, which will take some minutes. This step is done automatically using this command:
snakemake --use-conda --conda-create-envs-only --conda-frontend mamba
5. Run the pipeline.
Once the pipeline is configured and conda environments are created, the user just needs to run cluster_rnaseq.
snakemake --use-conda -j 32
The mandatory arguments are:
-
--use-conda : to install and use the conda environemnts.
-
-j : number of threads/jobs provided to snakemake.
Pipeline steps
Here you can find a general description of the main steps of the pipeline.
1. MD5 check and FASTQ quality control.
MD5 check
If MD5 hashes are included in
units.tsv
, the pipeline generates the hashes for your files and compares them with the ones in
units.tsv
; giving a error if they does not match.
General QC
cluster_rnaseq implements FastQC to check the overal quality of the input FASTQ files.
Contamination
FastQ Screen can optionally be enabled in order to check the input FASTQ files for contaminants.
MultiQC report
cluster_rnaseq creates a Quality Control HTML reports using MultiQC . This report includes information from FastQC, from both the original files and the complete samples after concatenating them; and also from the rest of tools used in the execution.
2. Concatenation, downsampling & trimming.
Before starting the analysis, the samples are concatenated by their lane, to obtain a complete file for each of the samples.
After this, there is the possibility of standardizing the number of reads of the files by performing a downsampling of the samples. The
downsamplig
parameter of
config.yaml
must be True and the desired number of reads for each sample must be indicated; as well as a seed to control the randomness of the analysys and allow to be replicated.
Finally, trimming is performed to remove the adapters used during sequencing. The bbduk tool, from the BBTools package, is used. By default, the file adapters.fa with different adapters is passed to the tool.
3. Alignment.
There are three different methods to align the reads: STAR, Salmon or HISAT2. You can choose the desired method in
config.yaml
.
STAR and HISAT2 are aligners and require a parameter, defined in the configuration file:
- The corresponding index path, whether it already exists or not
if the index does not exist, it can be generated through the pipe (explained later) with:
-
An annotation file containing the features of interest (in GTF format, must match the target genome)
-
A FASTA file with the genome
STAR alignment is sorted by coordinate internaly, which is necesary for the quatification. HISAT2 is not, so its output is treated with samtools sort to get an equivalent output.
Salmon is a pseudo-aligner that directly returns the counts for each of the samples, so none of the quantization tools in the next step will be used. It require a parameter, defined in the configuration file too:
- The corresponding index path, whether it already exists or not
if the index does not exist, it can be generated through the pipe (explained later) with:
-
A FASTA file containing the reference transcriptome
-
The genome assembly
4. Quantification.
Two tools are used to quantify STAR and HISAT2 results:
htseq-count
and
featureCounts
. In the same wat as for alignment, you can choose the desired method in
config.yaml
.
Before quantification, the BAM files with the alignments are indexed using samtools index , obtaining new .bam.bai files.
Both quantizers use both the BAM files generated by the aligners and the
.bam.bai
generated before. In addition, they need the same GTF file with the annotation used by the aligner. Other parameters can be configured in the
config.yaml
file, such as
strandedness
.
These quantification methods return a counts file for each sample. These files, like Salmon's, are transformed to generate the counts matrix,
counts.tsv
, from which the differential expression is performed.
5. Differential expression.
Differential expression is performed with the Deseq2 utility and consists of two steps. The first step creates the Deseq object dds.rds that will be used to perform the differential expression, as well as a file with the normalized counts and another with the variance stabilized values object, which will be used for plotting. It requires four inputs:
-
Counts matrix generated after quantification.
-
Design matrix file.
-
Design of the experiment, specified in
config.yaml
. -
File
samples.tsv
to know which samples should be taken into account.
Once the Deseq object is created, the second step is to carry out the differential expression from the dds.rds object generated above. For each contrast it returns four files:
Two equivalent files with the default deseq2 results:
-
(contrast)_diffexp.tsv
, a file plain text file format the data separated by tabs. -
(contrast)_diffexp.xlsx
, a file in excel format with the data formatted in a similar way to Nextpresso output.
And two equivalent files with the deseq2 results where the log2 foldchange estimates are shrunken using the function lfcShrink() from Deseq2. This shrinkage of the log2 foldchange estimates toward zero is performed when the information for a gene is low (low counts and high dispersion values). The two equivalent files contain the same columns as the default results, except for the 'stats' column which is not present in the output with shrunken log2 foldchange estimates:
-
(contrast)_lfcShrink_diffexp.tsv
, a file plain text file format the data separated by tabs. -
(contrast)_lfcShrink_diffexp.xlsx
, a file in excel format with the data formatted in a similar way to Nextpresso output.
6. Plots.
Once the Deseq object has been generated, different plots can be obtained for each of the contrasts:
-
PCA : Represents the distribution of the samples based on their first three principal components.
-
Distance : Calculates the distance between samples, using the Euclidean distance, to check if the samples belonging to the same condition have similarities.
-
Expression heatmap : From the normalized counts, creates a heatmap with the 25 most upregulated genes and the 25 most downregulated genes.
-
MA plot : Visualizes the differences between measurements by transforming the data onto M (log ratio) and A (mean average) scales.
7. Optional steps.
This pipeline has two additional options that, although they may be included in the ordinary execution, are also interesting independently.
7.1 Index generation
As mentioned in point 3, to perform the alignments it is necessary to have an index generated for each aligner. This index can be created beforehand and simply indicate the path in which it is located. But if it does not exist, it can be generated automatically adding to
config.yaml
the parameters indicated in point 3.
For STAR and HISAT2:
-
An annotation file containing the features of interest (in GTF format, must match the target genome)
-
A FASTA file with the genome
-
The output path where the index will be created
For Salmon:
-
A FASTA file containing the reference transcriptome
-
The genome assembly
-
The output path where the index will be created
7.2 Complete MultiQC report
Each execution of the pipeline returns a MultiQC report with the information of the aligner and the quantifier used in that iteration, in addition to the reports of the rest of the tools. However, there is the possibility of generating a MultiQC file that contains the results of all the executed tools, allowing to compare, for example, the results of different aligners.
Target rules
cluster_rnaseq features a shortcut system based on specfic targets rulesrelated to the pipeline's steps. Each target calls a end point rule which terminate the pipeline execution.
To use the shorcuts, you only need to run the pipeline as usual, but specifyung the target.
snakemake --use-conda -j 32 target_rule
The available targets are:
-
all : executes the whole pipepile, from the original files until the plots creation. It is the same as not indicating any rule.
-
files_qc : check MD5 hashes and perform quality control analysis with fastQC on the original files.
-
trimming : run cluster_rnaseq until trimming step included.
-
alignment : run cluster_rnaseq until aligment step included.
-
quantification : run cluster_rnaseq until get the counts matrix.
-
diffexp : run cluster_rnaseq until differential expression included.
-
plots : run cluster_rnaseq until get the final plots.
-
index : create an index for the aligner chosen in
config.yaml
. -
multiqc_all : search for any file in the results folder that can be added to a MultiQC report and create the report.
Additionally, the user might use the Snakemake rules names as targets, which are available in the config.yaml file.
Cluster Usage
This pipeline is part of the tools developed by the CNIO Bioinformatics Unit. Although it can be launched on any machine, it is designed to be executed in environments with high computational capacity such as the cluster that the center has. This uses slurm as a task manager, such a way that to launch the pipeline properly you must add the slurm profile created to launch the snakemake tools. So the command is:
snakemake --use-conda --profile $SMK_PROFILE_SLURM -j 32
In addition, to save the user time and space, there are shared resources in the CNIO cluster that can be referenced from cluster_rnaseq, such as the aligner indices that take time to create. In the file
template_config.yaml
are the paths to some of these resources which are updated periodically.
Configuration of computation resources
The user can configure cluster_rnaseq to optimise the available computational resources in order to reduce the computational time. The optimization is achieved thanks to Snakemake's ability to parallelize the jobs and to assign specific resources to each rule in a simple way. Resources configuration is done through the configuration file. This file has a field called resources , where the user can define the RAM memory usage, the number of threads (if the rule admits multithreading) available to a specific step and the maximum execution time. Additionally, if the user does not provide any value for some of these fields, the pipeline will use the default values.


Code Snippets
35 | shell: 'salmon quant -i {input.salmon_index} -l {params.libtype} -r {input.se_reads} --validateMappings -o {params.quant_directory} --threads {threads}' |
57 | shell: 'salmon quant -i {input.salmon_index} -l {params.libtype} -1 {input.r1_reads} -2 {input.r2_reads} --validateMappings -o {params.quant_directory} --threads {threads}' |
80 81 | script: "../scripts/star.py" |
104 105 | script: "../scripts/star.py" |
125 126 | wrapper: "0.74.0/bio/hisat2/align" |
143 144 | shell: 'samtools sort -@ {threads} -o {output.sortedCoord} {input.aligned}' |
22 23 | script: "../scripts/deseq2_init.R" |
45 46 | script: "../scripts/deseq2_diffexp.R" |
10 11 12 13 14 | shell: """ grep "^>" <(gunzip -c {input.genome}) | cut -d " " -f 1 > {output.decoys} sed -i -e 's/>//g' {output.decoys} """ |
25 26 | shell: 'cat {input.transcriptome} {input.genome} > {output.gentrome}' |
46 47 | shell: 'salmon index -t {input.gentrome} -d {input.decoys} -p {threads} -i {output} {params.gencode}' |
66 67 | wrapper: "0.74.0/bio/star/index" |
85 86 | wrapper: "0.74.0/bio/hisat2/index" |
20 21 | script: "../scripts/pca.R" |
40 41 | script: "../scripts/MAplot.R" |
61 62 | script: "../scripts/distances.R" |
83 84 | script: "../scripts/topbottomDEgenes.R" |
44 | shell: 'cat {input} > {output}' |
56 | shell: 'cat {input} > {output}' |
77 78 | wrapper: "v1.25.0/bio/bbtools/bbduk" |
99 100 | wrapper: "v1.25.0/bio/bbtools/bbduk" |
119 120 | wrapper: "0.74.0/bio/seqtk/subsample/se" |
141 142 | wrapper: "0.74.0/bio/seqtk/subsample/pe" |
90 91 | script: "../scripts/md5.py" |
114 115 | wrapper: "0.74.0/bio/fastqc" |
139 140 141 | shell:""" fastq_screen --threads {threads} --get_genomes --outdir {params.outdir}/ &> {log} """ |
167 168 | wrapper: "v1.23.4/bio/fastq_screen" |
188 189 | script: "../scripts/multiqc.py" |
211 212 | wrapper: "0.74.0/bio/fastqc" |
236 237 | wrapper: "v1.23.4/bio/fastq_screen" |
257 258 | script: "../scripts/multiqc.py" |
280 281 | script: "../scripts/multiqc.py" |
23 24 | shell: 'samtools index -@ {threads} {input.aligned}' |
48 | shell: 'htseq-count {params.extra} {params.mode} {params.strandedness} {input.bam_file} {params.annotation} > {output.quant} 2> {log}' |
65 66 | script: "../scripts/htseq_count_matrix.R" |
90 | shell: 'featureCounts {params.args} -T {threads} {params.extra} {params.strandedness} -a {params.annotation} -o {output.quant} {input.bam_file} &> {log}' |
105 106 | script: "../scripts/fcounts_count_matrix.py" |
130 131 | script: '../scripts/salmon_matrix_from_quant.R' |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") suppressMessages(library("DESeq2")) suppressMessages(library("openxlsx")) suppressMessages(library("org.Hs.eg.db")) suppressMessages(library("AnnotationDbi")) ## PARALLELIZATION ## parallel <- FALSE if (snakemake@threads > 1) { suppressMessages(library("BiocParallel")) register(MulticoreParam(snakemake@threads)) parallel <- TRUE } ## SNAKEMAKE I/O ## dds <- snakemake@input[["dds"]] ## SNAKEMAKE PARAMS ## condition <- snakemake@params[["condition"]] levels <- snakemake@params[["levels"]] ## CODE ## # Get dds dds <- readRDS(dds) # Get results res <- results(dds, contrast = c(condition, levels), alpha=0.05, parallel = parallel) # Annotate the GeneSymbol and complete GeneName from the ENSEMBL Gene ID. ensg_res <- rownames(res) ensemblGene_DEA <- gsub("\\.[0-9]*$", "", ensg_res) ensg_symbol <- as.data.frame(mapIds(org.Hs.eg.db, keys = ensemblGene_DEA, column = "SYMBOL", keytype = "ENSEMBL")) colnames(ensg_symbol) <- "GeneSymbol" ensg_genename <- as.data.frame(mapIds(org.Hs.eg.db, keys = ensemblGene_DEA, column = "GENENAME", keytype = "ENSEMBL")) colnames(ensg_genename) <- "GeneName" annot <- merge(ensg_symbol, ensg_genename, by = "row.names", all = TRUE) rownames(res) <- ensemblGene_DEA res <- merge(as.data.frame(res), annot, by.x = "row.names", by.y = 1, all = TRUE) ENSG_res_list <- as.list(ensg_res) names(ENSG_res_list) <- ensemblGene_DEA rownames(res) <- ENSG_res_list[res$Row.names] col_order <- c("GeneSymbol", "GeneName", "baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj") res <- res[, col_order] # Sort by adjusted p-value res <- res[order(res$padj, decreasing = FALSE), ] # Save tsv write.table(res, file = snakemake@output[["tsv"]], sep = "\t", quote = FALSE, col.names = NA) # Add Name column res$EnsemblGeneID <- rownames(res) res <- res[c("EnsemblGeneID", colnames(res)[1:(ncol(res)-1)])] # Green and red styles for formatting excel redStyle <- createStyle(fontColour = "#FF1F00", bgFill = "#F6F600") redPlainStyle <- createStyle(fontColour = "#FF1F00") greenStyle <- createStyle(fontColour = "#008000", bgFill = "#F6F600") greenPlainStyle <- createStyle(fontColour = "#008000") boldStyle <- createStyle(textDecoration = c("BOLD")) # Excel file wb <- createWorkbook() sheet <- "Sheet1" addWorksheet(wb, sheet) # Legend legend <- t(data.frame(paste("Upregulated in", levels[1]), paste("Downregulated in", levels[1]), "FDR=0.05")) writeData(wb, sheet, legend[, 1]) addStyle(wb, sheet, cols = 1, rows = 1, style = createStyle(fontColour = "#FF1F00", fgFill = "#F6F600")) addStyle(wb, sheet, cols = 1, rows = 2, style = createStyle(fontColour = "#008000", fgFill = "#F6F600")) addStyle(wb, sheet, cols = 1, rows = 3, style = boldStyle) invisible(sapply(1:3, function(i) mergeCells(wb, sheet, cols = 1:3, rows = i))) # Reorder genes according to adjusted p-value writeData(wb, sheet, res, startRow = 6) addStyle(wb, sheet, cols = 1:ncol(res), rows = 6, style = boldStyle, gridExpand = TRUE) conditionalFormatting(wb, sheet, cols = 1:ncol(res), rows = 7:(nrow(res)+6), rule = "AND($E7>0, $I7<0.05, NOT(ISBLANK($I7)))", style = redStyle) conditionalFormatting(wb, sheet, cols = 1:ncol(res), rows = 7:(nrow(res)+6), rule = "AND($E7>0, OR($I7>0.05, ISBLANK($I7)))", style = redPlainStyle) conditionalFormatting(wb, sheet, cols = 1:ncol(res), rows = 7:(nrow(res)+6), rule = "AND($E7<0, $I7<0.05, NOT(ISBLANK($I7)))", style = greenStyle) conditionalFormatting(wb, sheet, cols = 1:ncol(res), rows = 7:(nrow(res)+6), rule = "AND($E7<0, OR($I7>0.05, ISBLANK($I7)))", style = greenPlainStyle) setColWidths(wb, sheet, 1:ncol(res), widths = 13) # Save excel saveWorkbook(wb, file = snakemake@output[["xlsx"]], overwrite = TRUE) # GET SHRUNKEN LOG FOLD CHANGES. coef <- paste0(c(condition, levels[1], "vs", levels[2]), collapse = "_") res_shrink <- lfcShrink(dds, coef=coef, type="apeglm") # Annotate the GeneSymbol and complete GeneName from the ENSEMBL Gene ID. ensg_res <- rownames(res_shrink) ensemblGene_DEA <- gsub("\\.[0-9]*$", "", ensg_res) ensg_symbol <- as.data.frame(mapIds(org.Hs.eg.db, keys = ensemblGene_DEA, column = "SYMBOL", keytype = "ENSEMBL")) colnames(ensg_symbol) <- "GeneSymbol" ensg_genename <- as.data.frame(mapIds(org.Hs.eg.db, keys = ensemblGene_DEA, column = "GENENAME", keytype = "ENSEMBL")) colnames(ensg_genename) <- "GeneName" annot <- merge(ensg_symbol, ensg_genename, by = "row.names", all = TRUE) rownames(res_shrink) <- ensemblGene_DEA res_shrink <- merge(as.data.frame(res_shrink), annot, by.x = "row.names", by.y = 1, all = TRUE) ENSG_res_list <- as.list(ensg_res) names(ENSG_res_list) <- ensemblGene_DEA rownames(res_shrink) <- ENSG_res_list[res_shrink$Row.names] col_order <- c("GeneSymbol", "GeneName", "baseMean", "log2FoldChange", "lfcSE", "pvalue", "padj") res_shrink <- res_shrink[, col_order] # Sort by adjusted p-value res_shrink <- res_shrink[order(res_shrink$padj, decreasing = FALSE), ] # Save tsv write.table(res_shrink, file = snakemake@output[["tsv_lfcShrink"]], sep = "\t", quote = FALSE, col.names = NA) # Add Name column res_shrink$EnsemblGeneID <- rownames(res_shrink) res_shrink <- res_shrink[c("EnsemblGeneID", colnames(res_shrink)[1:(ncol(res_shrink)-1)])] # Green and red styles for formatting excel redStyle <- createStyle(fontColour = "#FF1F00", bgFill = "#F6F600") redPlainStyle <- createStyle(fontColour = "#FF1F00") greenStyle <- createStyle(fontColour = "#008000", bgFill = "#F6F600") greenPlainStyle <- createStyle(fontColour = "#008000") boldStyle <- createStyle(textDecoration = c("BOLD")) # Excel file wb <- createWorkbook() sheet <- "Sheet1" addWorksheet(wb, sheet) # Legend legend <- t(data.frame(paste("Upregulated in", levels[1]), paste("Downregulated in", levels[1]), "FDR=0.05")) writeData(wb, sheet, legend[, 1]) addStyle(wb, sheet, cols = 1, rows = 1, style = createStyle(fontColour = "#FF1F00", fgFill = "#F6F600")) addStyle(wb, sheet, cols = 1, rows = 2, style = createStyle(fontColour = "#008000", fgFill = "#F6F600")) addStyle(wb, sheet, cols = 1, rows = 3, style = boldStyle) invisible(sapply(1:3, function(i) mergeCells(wb, sheet, cols = 1:3, rows = i))) # Reorder genes according to adjusted p-value writeData(wb, sheet, res_shrink, startRow = 6) addStyle(wb, sheet, cols = 1:ncol(res_shrink), rows = 6, style = boldStyle, gridExpand = TRUE) conditionalFormatting(wb, sheet, cols = 1:ncol(res_shrink), rows = 7:(nrow(res_shrink)+6), rule = "AND($E7>0, $H7<0.05, NOT(ISBLANK($H7)))", style = redStyle) conditionalFormatting(wb, sheet, cols = 1:ncol(res_shrink), rows = 7:(nrow(res_shrink)+6), rule = "AND($E7>0, OR($H7>0.05, ISBLANK($H7)))", style = redPlainStyle) conditionalFormatting(wb, sheet, cols = 1:ncol(res_shrink), rows = 7:(nrow(res_shrink)+6), rule = "AND($E7<0, $H7<0.05, NOT(ISBLANK($H7)))", style = greenStyle) conditionalFormatting(wb, sheet, cols = 1:ncol(res_shrink), rows = 7:(nrow(res_shrink)+6), rule = "AND($E7<0, OR($H7>0.05, ISBLANK($H7)))", style = greenPlainStyle) setColWidths(wb, sheet, 1:ncol(res_shrink), widths = 13) # Save excel saveWorkbook(wb, file = snakemake@output[["xlsx_lfcShrink"]], overwrite = TRUE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") suppressMessages(library("DESeq2")) suppressMessages(library("tidyr")) suppressMessages(library("dplyr")) suppressMessages(library("limma")) ## PARALLELIZATION ## parallel <- FALSE if (snakemake@threads > 1) { suppressMessages(library("BiocParallel")) register(MulticoreParam(snakemake@threads)) parallel <- TRUE } ## SNAKEMAKE I/O ## counts <- snakemake@input[["counts"]] ## SNAKEMAKE PARAMS ## samples <- snakemake@params[["samples"]] designmatrix <- snakemake@params[["designmatrix"]] ref <- snakemake@params[["ref_levels"]] design <- snakemake@params[["design"]] batch_variables <- snakemake@params[["batch"]] ## FUNCTION ## factor_relevel <- function(x, reference) { relev <- relevel(as.factor(gsub("^\\*", "", as.character(x))), ref = reference[cur_column()]) return(relev) } ## CODE ## # Get counts counts <- read.table(counts, header = TRUE, row.names = 1) # Get samples for DEA samples <- read.table(samples, header = TRUE, row.names = 1) DEAsamples <- rownames(samples)[samples$diffexp == "True"] # Get design matrix designmatrix <- read.table(designmatrix, header = TRUE, row.names = 1) # Subset the count matrix keeping only samples to be tested counts <- counts %>% select(all_of(DEAsamples)) # Subset the design matrix keeping only samples to be tested designmatrix <- designmatrix[colnames(counts), , drop = FALSE] # Set names to reference levels for each column in design matrix ref <- setNames(ref, colnames(designmatrix)) # Remove '*' prefix from design matrix cells, convert all columns to factors # and relevel using ref designmatrix <- designmatrix %>% mutate(across(everything(), factor_relevel, reference = ref)) # DESeq2 from htseqCount output dds <- DESeqDataSetFromMatrix(counts, colData = designmatrix, design = as.formula(design)) # Pre-filtering keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep, ] # DEA dds <- DESeq(dds, parallel = parallel) # Normalized counts norm_counts <- counts(dds, normalized = TRUE) # Data transformation vsd <- vst(dds, blind = FALSE) # Save objects saveRDS(dds, file = snakemake@output[["dds"]]) write.table(norm_counts, file = snakemake@output[["normalized_counts"]], sep = "\t", quote = FALSE, col.names = NA) saveRDS(vsd, file = snakemake@output[["vst"]][1]) # If there is a batch, perform batch correction if (!is.null(batch_variables)) { ### colData coldata <- as.data.frame(colData(vsd)) # Batch correct the vst batch <- coldata %>% unite("combined", all_of(batch_variables), sep = ":") %>% select(combined) %>% pull assay(vsd) <- removeBatchEffect(assay(vsd), batch = batch) saveRDS(vsd, file = snakemake@output[["vst"]][2]) } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") suppressMessages(library("DESeq2")) suppressMessages(library("dplyr")) suppressMessages(library("scales")) suppressMessages(library("RColorBrewer")) suppressMessages(library("ComplexHeatmap")) source("scripts/levelFunctions.R") ## SNAKEMAKE I/O ## vsd <- snakemake@input[["vst"]] ## SNAKEMAKE PARAMS ## designmatrix <- snakemake@params[["designmatrix"]] levels <- snakemake@params[["levels"]] ref <- snakemake@params[["ref_levels"]] ## CODE ## # Get vst vsd <- readRDS(vsd) # Get design matrix designmatrix <- read.table(designmatrix, header = TRUE, row.names = 1) # colData coldata <- as.data.frame(colData(vsd)) %>% select(colnames(designmatrix)) # Set names to reference levels for each column in coldata ref <- setNames(ref, colnames(coldata)) # Convert all coldata columns to factors and relevel coldata <- coldata %>% mutate(across(everything(), factor_relevel, reference = ref)) # Assign a color to each level of each variable levels_colors <- color_levels(coldata) # Subset the sample names according to the specified levels coldata <- coldata %>% filter(condition %in% levels) samples <- rownames(coldata) levels_colors <- setNames(lapply(names(levels_colors), function(y) { levels_colors[[y]][names(levels_colors[[y]]) %in% unique(coldata[[y]])] }), names(levels_colors)) # Euclidean sample distances sampleDist <- as.matrix(dist(t(assay(vsd[, samples])), method = "euclidean")) # Colors colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255) # Heatmaps pdf(snakemake@output[["pdf"]], width = 9, height = 7) pheatmap(sampleDist, name = "Euclidean Distance", color = colors, annotation_col = coldata, annotation_colors = levels_colors) dev.off() png(snakemake@output[["png"]], width = 9, height = 7, units = "in", res = 600) pheatmap(sampleDist, name = "Euclidean Distance", color = colors, annotation_col = coldata, annotation_colors = levels_colors) dev.off() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import pandas as pd import numpy as np samples = pd.read_table(snakemake.params.samples).set_index("sample", drop=False) counts = [pd.read_table(f, index_col=0, usecols=[0, 6], header=None, skiprows=2) for f in snakemake.input] for t, sample in zip(counts, samples.index): t.columns = [sample] count_matrix = pd.concat(counts, axis=1) count_matrix.index.name = "geneID" # collapse technical replicates count_matrix = count_matrix.groupby(count_matrix.columns, axis=1).sum() count_matrix = count_matrix.apply(np.floor) count_matrix.to_csv(snakemake.output[0], sep="\t") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") ## SNAKEMAKE I/O ## htseqcount <- snakemake@input[["quant"]] ## CODE ## # Get counts counts <- lapply(htseqcount, read.table, header = FALSE, row.names = 1) # Merge counts counts <- do.call("cbind", counts) # Remove last rows counts <- counts[!startsWith(rownames(counts), "__"), ] # Add colnames samples <- sapply(htseqcount, function(x) { sub(pattern = "(.*?)\\..*$", replacement = "\\1", basename(x)) }) colnames(counts) <- samples # Save object write.table(counts, file = snakemake@output[["counts"]], sep = "\t", quote = FALSE, col.names = NA, row.names = TRUE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") suppressMessages(library("DESeq2")) suppressMessages(library("ggpubr")) suppressMessages(library("ggplot2")) suppressMessages(library("patchwork")) ## PARALLELIZATION ## parallel <- FALSE if (snakemake@threads > 1) { suppressMessages(library("BiocParallel")) register(MulticoreParam(snakemake@threads)) parallel <- TRUE } ## SNAKEMAKE I/O ## dds <- snakemake@input[["dds"]] ## SNAKEMAKE PARAMS ## condition <- snakemake@params[["condition"]] levels <- snakemake@params[["levels"]] ## CODE ## # Get dds dds <- readRDS(dds) # Get results res <- results(dds, contrast = c(condition, levels), alpha=0.05, parallel = parallel) # Log Fold Change shrinkage coef <- paste0(c(condition, levels[1], "vs", levels[2]), collapse = "_") res_apeglm <- lfcShrink(dds, coef = coef, type = "apeglm") res_normal <- lfcShrink(dds, coef = coef, type = "normal") res_ashr <- lfcShrink(dds, coef = coef, type = "ashr") # MA plots MA_apeglm <- ggmaplot(res_apeglm, fdr = 0.05, fc = 1.5, size = 0.7, top = 10, main = "apeglm", legend = "top", font.main = "bold", font.legend = "bold", font.label = c("bold", 11), label.rectangle = TRUE, ggtheme = theme_minimal()) + theme(plot.title = element_text(hjust = 0.5)) MA_normal <- ggmaplot(res_normal, fdr = 0.05, fc = 1.5, size = 0.7, top = 10, main = "normal", legend = "top", font.main = "bold", font.legend = "bold", font.label = c("bold", 11), label.rectangle = TRUE, ggtheme = theme_minimal()) + theme(plot.title = element_text(hjust = 0.5)) MA_ashr <- ggmaplot(res_ashr, fdr = 0.05, fc = 1.5, size = 0.7, top = 10, main = "ashr", legend = "top", font.main = "bold", font.legend = "bold", font.label = c("bold", 11), label.rectangle = TRUE, ggtheme = theme_minimal()) + theme(plot.title = element_text(hjust = 0.5)) # Make patchwork with the three plots MA <- MA_apeglm + MA_normal + MA_ashr + plot_layout(widths = rep(5, 3)) # Save PCA plot ggsave(filename = snakemake@output[["pdf"]], plot = MA, width = 20) ggsave(filename = snakemake@output[["png"]], plot = MA, dpi = 600, width = 20) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | import hashlib import os import pandas as pd import numpy as np def get_md5(read): cur_dir = os.getcwd() read_path = os.path.join(cur_dir, read) if os.path.isfile(read_path): with open(read_path, "rb") as file_name: content = file_name.read() aux_var = hashlib.md5() aux_var.update(content) return aux_var.hexdigest() else: return 'missing_file' ## SNAKEMAKE I/O ## units = snakemake.input['units'] ## SNAKEMAKE PARAMS ## logdir = snakemake.params['logdir'] units = pd.read_csv(units, sep='\t') md5s = units.set_index('fq1')['md5_fq1'].to_dict() md5s.update(units.set_index('fq2')['md5_fq2'].to_dict()) md5_check = pd.DataFrame.from_dict(md5s, orient='index', columns=['md5']) md5_check['fastq'] = md5_check.index md5_check = md5_check.reindex(columns=['fastq', 'md5']) md5_check = md5_check.dropna(subset=["fastq", "md5"]) md5_check['md5_fq_check'] = md5_check[md5_check['fastq'].notnull()]['fastq'].apply(get_md5) if not md5_check.empty: md5_check['md5_fq_check'] = np.where(md5_check['md5_fq_check'].str.lower() == md5_check['md5'].str.lower(), 'Passed', 'Failed') md5_check[md5_check['md5'].notnull()].to_csv(f"{logdir}/md5.done.tsv", sep = '\t', index=False) if (md5_check[md5_check['md5'].notnull()]['md5_fq_check'] == 'Failed').any(): sys.exit(f"\nSome files did not pass the checksum veryfication.\nCheck the report at {logdir}/md5.done.tsv\nExiting...\n") else: sys.exit(0) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") suppressMessages(library("DESeq2")) suppressMessages(library("stringr")) suppressMessages(library("dplyr")) suppressMessages(library("tidyr")) suppressMessages(library("scales")) suppressMessages(library("ggplot2")) suppressMessages(library("ggrepel")) suppressMessages(library("patchwork")) source("scripts/plotPCA.3.R") source("scripts/levelFunctions.R") ## SNAKEMAKE I/O ## vsd <- snakemake@input[["vst"]] ## SNAKEMAKE PARAMS ## designmatrix <- snakemake@params[["designmatrix"]] levels <- snakemake@params[["levels"]] design <- snakemake@params[["design"]] ref <- snakemake@params[["ref_levels"]] ## CODE ## # Get vst vsd <- readRDS(vsd) # Get design matrix designmatrix <- read.table(designmatrix, header = TRUE, row.names = 1) # Get variables to plot by from design variables <- unlist(str_split(design, pattern = "\\*|:|\\+")) variables <- rev(str_trim(str_remove(variables, "~"))) # colData coldata <- as.data.frame(colData(vsd)) %>% select(colnames(designmatrix)) # Set names to reference levels for each column in coldata ref <- setNames(ref, colnames(coldata)) # Convert all coldata columns to factors and relevel coldata <- coldata %>% mutate(across(everything(), factor_relevel, reference = ref)) # Get variables' all combined levels all_levels <- expand.grid(rev(lapply(coldata, levels))) %>% unite("combined", all_of(variables), sep = ":") %>% select(combined) %>% pull # Assign a color to each level colors <- setNames(hue_pal()(length(all_levels)), all_levels) # Subset the sample names according to the specified levels coldata <- coldata %>% filter(get(variables[1]) %in% levels) samples <- rownames(coldata) # Keep only specified levels levels <- coldata %>% unite("combined", all_of(variables), sep = ":") %>% select(combined) %>% pull %>% unique # Subset the colors according to the specified levels colors <- colors[levels] # Subset samples vsd <- vsd[, samples] # PCA plots for the first 3 principal components pca12 <- plotPCA.3(vsd, intgroup = variables) + scale_color_manual(values = colors, breaks = all_levels) pca13 <- plotPCA.3(vsd, intgroup = variables, pc = c(1, 3)) + scale_color_manual(values = colors, breaks = all_levels) pca23 <- plotPCA.3(vsd, intgroup = variables, pc = c(2, 3)) + scale_color_manual(values = colors, breaks = all_levels) # Make patchwork with the three plots pca <- pca12 + pca13 + pca23 + plot_layout(guides = "collect", widths = rep(1, 3)) & theme(legend.position = "bottom") # Save PCA plot ggsave(filename = snakemake@output[["pdf"]], plot = pca, width = 20) ggsave(filename = snakemake@output[["png"]], plot = pca, dpi = 600, width = 20) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | library('DESeq2') library('readr') library('tximeta') ## SNAKEMAKE I/O ## metadata_cache <- snakemake@output[['metadata_cache']] tximeta_transcripts <- snakemake@output[['transcript_estimates']] gene_level_matrix <- snakemake@output[['gene_level_matrix']] # NOTE: We don't really fetch the expanded list of quant files from the # snakemake object since we don't really need it. It's there so that # this rule does not get executed until all samples have been quantified. ## SNAKEMAKE PARAMS ## salmon_quant_directory <- snakemake@params[['salmon_quant_directory']] samples_files <- snakemake@params[['samples']] ## SNAKEMAKE LOG ## log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") # SET metadata cache folder setTximetaBFC(metadata_cache) samples <- read.csv(samples_files, sep="\t", header=TRUE) rownames(samples) <- samples$sample quant_files <- file.path(salmon_quant_directory, samples$sample, "quant.sf") # tximeta looks for two columns: names for samples and files for paths samples$files <- quant_files samples$names <- samples$sample se <- tximeta(coldata = samples, type = 'salmon') gse <- summarizeToGene(se) ## This may seem dumb but it let us use the DESeqDataSet wrapper ## to transform gene-level estimates to integer counts taking into ## account the average transcript lengths from tximeta ## An alternative is to just round them up, as with tximport. dds <- DESeqDataSet(se=gse, design=~1) #blind design raw_counts <- counts(dds, normalized=FALSE) saveRDS(se, tximeta_transcripts) write.table(raw_counts,file = gene_level_matrix, sep = "\t", quote = FALSE, col.names = NA, row.names = TRUE) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) fq1 = snakemake.input.get("fq1") assert fq1 is not None, "input-> fq1 is a required input parameter" fq1 = ( [snakemake.input.fq1] if isinstance(snakemake.input.fq1, str) else snakemake.input.fq1 ) fq2 = snakemake.input.get("fq2") if fq2: fq2 = ( [snakemake.input.fq2] if isinstance(snakemake.input.fq2, str) else snakemake.input.fq2 ) assert len(fq1) == len( fq2 ), "input-> equal number of files required for fq1 and fq2" input_str_fq1 = ",".join(fq1) input_str_fq2 = ",".join(fq2) if fq2 is not None else "" input_str = " ".join([input_str_fq1, input_str_fq2]) if fq1[0].endswith(".gz"): readcmd = "--readFilesCommand gunzip -c" else: readcmd = "" outprefix = os.path.dirname(snakemake.output[0]) + "/" shell( "STAR " "{extra} " "--runThreadN {snakemake.threads} " "--genomeDir {snakemake.input.index} " "--readFilesIn {input_str} " "{readcmd} " "--outFileNamePrefix {outprefix} " "--outStd Log " "{log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | log <- file(snakemake@log[[1]], open = "wt") sink(log) sink(log, type = "message") suppressMessages(library("plyr")) suppressMessages(library("dplyr")) suppressMessages(library("tibble")) suppressMessages(library("scales")) suppressMessages(library("RColorBrewer")) suppressMessages(library("ComplexHeatmap")) source("scripts/levelFunctions.R") ## SNAKEMAKE I/O ## norm_counts <- snakemake@input[["normalized_counts"]] diffexp <- snakemake@input[["diffexp"]] ## SNAKEMAKE PARAMS ## levels <- snakemake@params[["levels"]] designmatrix <- snakemake@params[["designmatrix"]] ref <- snakemake@params[["ref_levels"]] ## CODE ## # Get normalized counts norm_counts <- read.table(norm_counts) # Get differential expression results diffexp <- read.table(diffexp, , header = TRUE, row.names = 1, sep = '\t', blank.lines.skip = FALSE, quote = "") diffexp <- diffexp[c("baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj")] # Get design matrix designmatrix <- read.table(designmatrix, header = TRUE, row.names = 1) # Keep only significantly expressed genes diffexp <- diffexp %>% rownames_to_column %>% filter(padj < 0.05) # Subset the design matrix keeping only samples to be tested designmatrix <- designmatrix[colnames(norm_counts), , drop = FALSE] # Set names to reference levels for each column in design matrix ref <- setNames(ref, colnames(designmatrix)) # Remove '*' prefix from design matrix cells, convert all columns to factors # and relevel designmatrix <- designmatrix %>% mutate(across(everything(), factor_relevel, reference = ref)) # Assign a color to each level of each variable levels_colors <- color_levels(designmatrix) # Subset the sample names according to the specified levels designmatrix <- designmatrix %>% filter(condition %in% levels) %>% select(condition) samples <- rownames(designmatrix) levels_colors <- list(condition = levels_colors[["condition"]] [names(levels_colors[["condition"]]) %in% levels]) # Get top 25 and bottom 25 DE genes top <- diffexp %>% arrange(desc(log2FoldChange)) %>% top_n(n = 25, wt = log2FoldChange) %>% select(rowname) %>% pull bottom <-diffexp %>% arrange(log2FoldChange) %>% top_n(n = -25, wt = log2FoldChange) %>% select(rowname) %>% pull # Subset the normalized count matrix and scale by row norm_counts <- t(apply(norm_counts[c(top, bottom), samples], 1, scale)) colnames(norm_counts) <- samples # Specify the colors and breaks of the plot colors <- colorRampPalette(c("blue", "white", "red"))(100) break_limit <- round_any(max(max(norm_counts), abs(min(norm_counts))), 0.5, ceiling) breaks <- seq(-break_limit, break_limit, length.out = 101) # Heatmaps pdf(snakemake@output[["pdf"]], width = 9, height = 7) pheatmap(norm_counts, name = "Gene Expression", color = colors, breaks = breaks, annotation_col = designmatrix, annotation_colors = levels_colors) dev.off() png(snakemake@output[["png"]], width = 9, height = 7, units = "in", res = 600) pheatmap(norm_counts, name = "Gene Expression", color = colors, breaks = breaks, annotation_col = designmatrix, annotation_colors = levels_colors) dev.off() |
1
of
scripts/topbottomDEgenes.R
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path import re from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) # Remove file extension(s) (similar to the internal fastqc approach) base = re.sub("\\.gz$", "", base) base = re.sub("\\.bz2$", "", base) base = re.sub("\\.txt$", "", base) base = re.sub("\\.fastq$", "", base) base = re.sub("\\.fq$", "", base) base = re.sub("\\.sam$", "", base) base = re.sub("\\.bam$", "", base) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell( "fastqc {snakemake.params} -t {snakemake.threads} " "--outdir {tempdir:q} {snakemake.input[0]:q}" " {log}" ) # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path:q} {snakemake.output.html:q}") if snakemake.output.zip != zip_path: shell("mv {zip_path:q} {snakemake.output.zip:q}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | __author__ = "Wibowo Arindrarto" __copyright__ = "Copyright 2016, Wibowo Arindrarto" __email__ = "bow@bow.web.id" __license__ = "BSD" from snakemake.shell import shell # Placeholder for optional parameters extra = snakemake.params.get("extra", "") # Run log log = snakemake.log_fmt_shell() # Input file wrangling reads = snakemake.input.get("reads") if isinstance(reads, str): input_flags = "-U {0}".format(reads) elif len(reads) == 1: input_flags = "-U {0}".format(reads[0]) elif len(reads) == 2: input_flags = "-1 {0} -2 {1}".format(*reads) else: raise RuntimeError( "Reads parameter must contain at least 1 and at most 2" " input files." ) # Executed shell command shell( "(hisat2 {extra} " "--threads {snakemake.threads} " " -x {snakemake.params.idx} {input_flags} " " | samtools view -Sbh -o {snakemake.output[0]} -) " " {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | __author__ = "Joël Simoneau" __copyright__ = "Copyright 2019, Joël Simoneau" __email__ = "simoneaujoel@gmail.com" __license__ = "MIT" import os from snakemake.shell import shell # Creating log log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Placeholder for optional parameters extra = snakemake.params.get("extra", "") # Allowing for multiple FASTA files fasta = snakemake.input.get("fasta") assert fasta is not None, "input-> a FASTA-file or a sequence is required" input_seq = "" if not "." in fasta: input_seq += "-c " input_seq += ",".join(fasta) if isinstance(fasta, list) else fasta hisat_dir = snakemake.params.get("prefix", "") if hisat_dir: os.makedirs(hisat_dir) shell( "hisat2-build {extra} " "-p {snakemake.threads} " "{input_seq} " "{snakemake.params.prefix} " "{log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | __author__ = "Fabian Kilpert" __copyright__ = "Copyright 2020, Fabian Kilpert" __email__ = "fkilpert@gmail.com" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell() shell( "( " "seqtk sample " "-s {snakemake.params.seed} " "{snakemake.input.f1} " "{snakemake.params.n} " "| pigz -9 -p {snakemake.threads} " "> {snakemake.output.f1} " "&& " "seqtk sample " "-s {snakemake.params.seed} " "{snakemake.input.f2} " "{snakemake.params.n} " "| pigz -9 -p {snakemake.threads} " "> {snakemake.output.f2} " ") {log} " ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | __author__ = "Fabian Kilpert" __copyright__ = "Copyright 2020, Fabian Kilpert" __email__ = "fkilpert@gmail.com" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell() shell( "( " "seqtk sample " "-s {snakemake.params.seed} " "{snakemake.input} " "{snakemake.params.n} " "| pigz -9 -p {snakemake.threads} " "> {snakemake.output} " ") {log} " ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | __author__ = "Thibault Dayris" __copyright__ = "Copyright 2019, Dayris Thibault" __email__ = "thibault.dayris@gustaveroussy.fr" __license__ = "MIT" from snakemake.shell import shell from snakemake.utils import makedirs log = snakemake.log_fmt_shell(stdout=True, stderr=True) extra = snakemake.params.get("extra", "") sjdb_overhang = snakemake.params.get("sjdbOverhang", "100") gtf = snakemake.input.get("gtf") if gtf is not None: gtf = "--sjdbGTFfile " + gtf sjdb_overhang = "--sjdbOverhang " + sjdb_overhang else: gtf = sjdb_overhang = "" makedirs(snakemake.output) shell( "STAR " # Tool "--runMode genomeGenerate " # Indexation mode "{extra} " # Optional parameters "--runThreadN {snakemake.threads} " # Number of threads "--genomeDir {snakemake.output} " # Path to output "--genomeFastaFiles {snakemake.input.fasta} " # Path to fasta files "{sjdb_overhang} " # Read-len - 1 "{gtf} " # Highly recommended GTF "{log}" # Logging ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | import os import re from snakemake.shell import shell import tempfile __author__ = "Ryan Dale" __copyright__ = "Copyright 2016, Ryan Dale" __email__ = "dalerr@niddk.nih.gov" __license__ = "MIT" _config = snakemake.params["fastq_screen_config"] subset = snakemake.params.get("subset", 100000) aligner = snakemake.params.get("aligner", "bowtie2") extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell() # snakemake.params.fastq_screen_config can be either a dict or a string. If # string, interpret as a filename pointing to the fastq_screen config file. # Otherwise, create a new tempfile out of the contents of the dict: if isinstance(_config, dict): tmp = tempfile.NamedTemporaryFile(delete=False).name with open(tmp, "w") as fout: for label, indexes in _config["database"].items(): for aligner, index in indexes.items(): fout.write( "\t".join(["DATABASE", label, index, aligner.upper()]) + "\n" ) for aligner, path in _config["aligner_paths"].items(): fout.write("\t".join([aligner.upper(), path]) + "\n") config_file = tmp else: config_file = _config # fastq_screen hard-codes filenames according to this prefix. We will send # hard-coded output to a temp dir, and then move them later. prefix = re.split(".fastq|.fq|.txt|.seq", os.path.basename(snakemake.input[0]))[0] tempdir = tempfile.mkdtemp() shell( "fastq_screen --outdir {tempdir} " "--force " "--aligner {aligner} " "--conf {config_file} " "--subset {subset} " "--threads {snakemake.threads} " "{extra} " "{snakemake.input[0]} " "{log}" ) # Move output to the filenames specified by the rule shell("mv {tempdir}/{prefix}_screen.txt {snakemake.output.txt}") shell("mv {tempdir}/{prefix}_screen.png {snakemake.output.png}") # Clean up temp shell("rm -r {tempdir}") if isinstance(_config, dict): shell("rm {tmp}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | __author__ = "Filipe G. Vieira" __copyright__ = "Copyright 2021, Filipe G. Vieira" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts java_opts = get_java_opts(snakemake) extra = snakemake.params.get("extra", "") adapters = snakemake.params.get("adapters", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) n = len(snakemake.input.sample) assert ( n == 1 or n == 2 ), "input->sample must have 1 (single-end) or 2 (paired-end) elements." if n == 1: reads = "in={}".format(snakemake.input.sample) trimmed = "out={}".format(snakemake.output.trimmed) else: reads = "in={} in2={}".format(*snakemake.input.sample) trimmed = "out={} out2={}".format(*snakemake.output.trimmed) singleton = snakemake.output.get("singleton", "") if singleton: singleton = f"outs={singleton}" discarded = snakemake.output.get("discarded", "") if discarded: discarded = f"outm={discarded}" stats = snakemake.output.get("stats", "") if stats: stats = f"stats={stats}" shell( "bbduk.sh {java_opts} t={snakemake.threads} " "{reads} " "{adapters} " "{extra} " "{trimmed} {singleton} {discarded} " "{stats} " "{log}" ) |
Support
- Future updates
Related Workflows





