Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This workflow performs a differential expression analysis with Kallisto and Sleuth .
Authors
-
Johannes Köster (https://koesterlab.github.io)
-
David Lähnemann
Usage
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).
Step 1: Obtain a copy of this workflow
-
Create a new github repository using this workflow as a template .
-
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
Step 2: Configure workflow
Configure the workflow according to your needs via editing the file
config/config.yaml
, as well as the sample sheet
config/samples.tsv
, and the unit sheet
config/units.tsv
.
Step 3: Execute workflow
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.
Step 4: Investigate results
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .
Step 5: Commit changes
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
Step 6: Obtain updates from upstream
Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.
-
Once, register the upstream repository in your local copy:
git remote add -f upstream git@github.com:snakemake-workflows/dna-seq-varlociraptor.git
orgit remote add -f upstream https://github.com/snakemake-workflows/dna-seq-varlociraptor.git
if you do not have setup ssh keys. -
Update the upstream version:
git fetch upstream
. -
Create a diff with the current version:
git diff HEAD upstream/master workflow > upstream-changes.diff
. -
Investigate the changes:
vim upstream-changes.diff
. -
Apply the modified diff via:
git apply upstream-changes.diff
. -
Carefully check whether you need to update the config files:
git diff HEAD upstream/master config
. If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.
Step 7: Contribute back
In case you have also changed or added steps, please consider contributing them back to the original repository:
-
Fork the original repo to a personal or lab account.
-
Clone the fork to your local system, to a different place than where you ran your analysis.
-
Copy the modified files from your analysis to the clone of your fork, e.g.,
cp -r workflow path/to/fork
. Make sure to not accidentally copy config file contents or sample sheets. -
Adapt the example config files if needed.
-
Commit and push your changes to your fork.
-
Create a pull request against the original repository.
Testing
Test cases are in the subfolder
.test
and are automtically executed via continuous integration with
Github Actions
. As
creating a workflow-repository from the template
does not include the submodule with the testing data, getting the GitHub Actions tests to pass requires adding the submodule:
cd .test
git submodule add -f https://github.com/snakemake-workflows/ngs-test-data.git
git commit -m "add ngs-test-data submodule to get GitHub Actions tests to pass"
Code Snippets
13 14 15 16 17 18 19 | run: samples_ = units[["sample", "unit"]].merge(samples, on="sample") samples_["sample"] = samples_.apply( lambda row: "{}-{}".format(row["sample"], row["unit"]), axis=1) samples_["path"] = kallisto_output del samples_["unit"] samples_.to_csv(output[0], sep="\t") |
44 45 | script: "../scripts/sleuth-init.R" |
84 85 | script: "../scripts/sleuth-diffexp.R" |
101 102 | script: "../scripts/ihw-fdr-control.R" |
124 125 | script: "../scripts/plot-bootstrap.R" |
139 140 | script: "../scripts/plot-pca.R" |
155 156 | script: "../scripts/plot-diffexp-heatmap.R" |
170 171 | script: "../scripts/plot-diffexp-pval-hist.R" |
183 184 | script: "../scripts/sleuth-to-matrix.R" |
195 196 | script: "../scripts/plot-group-density.R" |
209 210 | script: "../scripts/plot-scatter.R" |
221 222 | script: "../scripts/plot-fld.R" |
236 237 | script: "../scripts/plot-variances.R" |
12 13 | shell: "conda create --quiet --yes -p {params.path} --channel bioconda bioconductor-{wildcards.package}={params.version}" |
40 41 | script: "../scripts/spia.R" |
83 84 | script: "../scripts/fgsea.R" |
106 107 | script: "../scripts/plot-fgsea-gene-sets.R" |
122 123 | script: "../scripts/ens_gene_to_go.R" |
135 136 | shell: "( curl --silent -o {output} {params.download} ) 2> {log}" |
165 166 | script: "../scripts/goatools-go-enrichment-analysis.py" |
10 11 | shell: "kallisto index -i {output} {input} 2> {log}" |
39 40 41 | shell: "kallisto quant -i {input.idx} -o {output} " "{params.extra} {input.fq} 2> {log}" |
12 13 | wrapper: "0.31.1/bio/cutadapt/pe" |
26 27 | wrapper: "0.31.1/bio/cutadapt/se" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") # provides `tidyverse` and load_bioconductor_package() source( file.path(snakemake@scriptdir, 'common.R') ) pkg <- snakemake@params[["bioc_pkg"]] load_bioconductor_package(snakemake@input[["species_anno"]], pkg) ens_gene_to_go <- AnnotationDbi::select( get(pkg), keys=keys(get(pkg), keytype="ENSEMBL"), columns=c("GO"), keytype="ENSEMBL" ) %>% # rows with empty ENSEMBL or GO IDs are useless and will lead to trouble below drop_na(ENSEMBL,GO) %>% dplyr::rename(ensembl_gene_id = ENSEMBL) %>% group_by(ensembl_gene_id) %>% summarise(go_ids = str_c(GO, collapse = ";")) write_tsv(ens_gene_to_go, path = snakemake@output[[1]], col_names = FALSE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("fgsea") # provides library("tidyverse") and functions load_bioconductor_package() and # get_prefix_col(), the latter requires snakemake@output[["samples"]] and # snakemake@params[["covariate"]] source( file.path(snakemake@scriptdir, 'common.R') ) pkg <- snakemake@params[["bioc_pkg"]] load_bioconductor_package(snakemake@input[["species_anno"]], pkg) gene_sets <- gmtPathways(snakemake@input[["gene_sets"]]) diffexp <- read_tsv(snakemake@input[["diffexp"]]) %>% drop_na(ext_gene) %>% mutate(ext_gene = str_to_upper(ext_gene)) %>% # resolve multiple occurences of the same ext_gene, usually # meaning that several ENSEMBLE genes map to the same gene # symbol -- so we may loose resolution here, as long as gene # symbols are used group_by(ext_gene) %>% filter( qval == min(qval, na.rm = TRUE) ) %>% filter( pval == min(pval, na.rm = TRUE) ) %>% # for the case of min(qval) == 1 and min(pval) == 1, we # need something else to select a unique entry per gene filter( mean_obs == max(mean_obs, na.rm = TRUE) ) %>% mutate(target_id = str_c(target_id, collapse=",")) %>% mutate(ens_gene = str_c(ens_gene, collapse=",")) %>% distinct() signed_pi <- get_prefix_col("signed_pi_value", colnames(diffexp)) ranked_genes <- diffexp %>% dplyr::select(ext_gene, !!signed_pi) %>% deframe() # get and write out rank values that are tied -- a way to check up on respecitve warnings rank_ties <- enframe(ranked_genes) %>% mutate(dup = duplicated(value) | duplicated(value, fromLast = TRUE) ) %>% filter(dup == TRUE) %>% dplyr::select(ext_gene = name, !!signed_pi := value) write_tsv(rank_ties, snakemake@output[["rank_ties"]]) fgsea_res <- fgsea(pathways = gene_sets, stats = ranked_genes, minSize=10, maxSize=700, nproc=snakemake@threads, nperm=snakemake@params[["nperm"]] ) %>% as_tibble() # Annotation is impossible without any entries, so then just write out empty files if ( (fgsea_res %>% count() %>% pull(n)) == 0 ) { write_tsv(fgsea_res, path = snakemake@output[["enrichment"]]) write_tsv(fgsea_res, path = snakemake@output[["significant"]]) } else { # add further annotation annotated <- fgsea_res %>% unnest(leadingEdge) %>% mutate( leading_edge_symbol = str_to_title(leadingEdge), leading_edge_entrez_id = mapIds(leading_edge_symbol, x=get(pkg), keytype="SYMBOL", column="ENTREZID"), leading_edge_ens_gene = mapIds(leading_edge_symbol, x=get(pkg), keytype="SYMBOL", column="ENSEMBL") ) %>% group_by(pathway) %>% summarise( leadingEdge = str_c(leadingEdge, collapse = ','), leading_edge_symbol = str_c(leading_edge_symbol, collapse = ','), leading_edge_entrez_id = str_c(leading_edge_entrez_id, collapse = ','), leading_edge_ens_gene = str_c(leading_edge_ens_gene, collapse = ',') ) %>% inner_join(fgsea_res %>% dplyr::select(-leadingEdge), by = "pathway") %>% dplyr::select(-leadingEdge, -leading_edge_symbol, -leading_edge_entrez_id, -leading_edge_ens_gene, leading_edge_symbol, leading_edge_ens_gene, leading_edge_entrez_id, leadingEdge) # write out fgsea results for all gene sets write_tsv(annotated, path = snakemake@output[["enrichment"]]) # select significant pathways sig_gene_sets <- annotated %>% filter( padj < snakemake@params[["gene_set_fdr"]] ) # write out fgsea results for gene sets found to be significant write_tsv(sig_gene_sets, path = snakemake@output[["significant"]]) } height = .7 * (length(gene_sets) + 2) # table plot of all gene sets tg <- plotGseaTable( pathway = gene_sets, stats = ranked_genes, fgseaRes = fgsea_res, gseaParam = 1, render = FALSE ) ggsave(filename = snakemake@output[["plot"]], plot = tg, width = 12, height = height, limitsize=FALSE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | import sys sys.stderr = open(snakemake.log[0], "w") import pandas as pd import matplotlib.pyplot as plt from goatools.obo_parser import GODag from goatools.anno.idtogos_reader import IdToGosReader from goatools.goea.go_enrichment_ns import GOEnrichmentStudyNS from goatools.godag_plot import plot_results#, plot_goid2goobj, plot_gos # read in directed acyclic graph of GO terms / IDs obodag = GODag(snakemake.input.obo) # read in mapping gene ids from input to GO terms / IDs objanno = IdToGosReader(snakemake.input.ens_gene_to_go, godag = obodag) # extract namespace(?) -> id2gos mapping ns2assoc = objanno.get_ns2assc() for nspc, id2gos in ns2assoc.items(): print("{NS} {N:,} annotated genes".format(NS=nspc, N=len(id2gos))) # read gene diffexp table all_genes = pd.read_table(snakemake.input.diffexp) # select genes significantly differentially expressed according to BH FDR of sleuth fdr_level_gene = float(snakemake.params.gene_fdr) sig_genes = all_genes[all_genes['qval']<fdr_level_gene] # initialize GOEA object fdr_level_go_term = float(snakemake.params.go_term_fdr) goeaobj = GOEnrichmentStudyNS( # list of 'population' of genes looked at in total pop = all_genes['ens_gene'].tolist(), # geneid -> GO ID mapping ns2assoc = ns2assoc, # ontology DAG godag = obodag, propagate_counts = False, # multiple testing correction method (fdr_bh is false discovery rate control with Benjamini-Hochberg) methods = ['fdr_bh'], # significance cutoff for method named above alpha = fdr_level_go_term ) goea_results_all = goeaobj.run_study(sig_genes['ens_gene'].tolist()) # write results to text file goeaobj.wr_tsv(snakemake.output.enrichment, goea_results_all) # plot results ensembl_id_to_symbol = dict(zip(all_genes['ens_gene'], all_genes['ext_gene'])) # from first plot output file name, create generic file name to trigger # separate plots for each of the gene ontology name spaces outplot_generic = snakemake.output.plot[0].replace('_BP.','_{NS}.', 1).replace('_CC.','_{NS}.', 1).replace('_MF.', '_{NS}.', 1) goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < fdr_level_go_term] plot_results( outplot_generic, # use pvals for coloring goea_results=goea_results_sig, # print general gene symbol instead of Ensembl ID id2symbol=ensembl_id_to_symbol, # number of genes to print, or True for printing all (including count of genes) study_items=True, # number of genes to print per line items_p_line=6, # p-values to use for coloring of GO term nodes (argument name determined from code and testing against value "p_uncorrected") pval_name="p_fdr_bh" ) # for all name spaces for ns in ns2assoc.keys(): # check if no GO terms were found to be significant if len([r for r in goea_results_sig if r.NS == ns]) == 0: fig = plt.figure(figsize=(12, 2)) text = fig.text(0.5, 0.5, "No plot generated, because no GO terms were found significant\n" "for name space {} and significance levels: genes ({}), GO terms ({}).\n" "You might want to check those levels and/or your intermediate data.".format( ns, fdr_level_gene, fdr_level_go_term), ha='center', va='center', size=20) fig.savefig( outplot_generic.replace('_{NS}.', "_{}.".format(ns)) ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("tidyverse") library("ggpubr") library("IHW") number_of_groups <- 7 gene_data <- read_tsv(snakemake@input[[1]]) level_name <- snakemake@wildcards[["level"]] # calculate covariate mean_obs for gene.aggregated data if(level_name == "genes-aggregated") { gene_data <- gene_data %>% mutate(mean_obs = if_else(num_aggregated_transcripts > 0, sum_mean_obs_counts / num_aggregated_transcripts, 0), ens_gene = target_id) } # determine the appropriate number of groups for grouping in ihw calculation tested_number_of_groups <- number_of_groups ihw_test_grouping <- function(x){ tryCatch( expr = { ihw(pval ~ mean_obs, data = gene_data, alpha = 0.1, nbins = x) return(TRUE) }, error = function(e) { print(str_c("Number of groups was set too hight, trying grouping by ", tested_number_of_groups - 1, " groups.")) return(FALSE) }) } while(!ihw_test_grouping(tested_number_of_groups) && tested_number_of_groups > 0){ tested_number_of_groups <- tested_number_of_groups -1 ihw_test_grouping(tested_number_of_groups) } if (tested_number_of_groups < number_of_groups) { print(str_c("The chosen number of groups for ", level_name, " dataset was too large, instead IHW was calculated on ", tested_number_of_groups, " groups.")) number_of_groups <- tested_number_of_groups } gene_data <- gene_data %>% drop_na(pval, mean_obs) %>% select(ens_gene, ext_gene, pval, mean_obs) %>% mutate(grouping = groups_by_filter(mean_obs, number_of_groups)) ### diagnostic plots for covariate and grouping #dispersion plot dispersion <- ggplot(gene_data, aes(x = percent_rank(mean_obs), y = -log10(pval))) + geom_point() + ggtitle("IHW: scatter plot of p-values vs. mean of counts") + theme(plot.title = element_text(size = 12, face = "bold", hjust = 0.5)) + xlab("percent rank of mean_obs") + ylab(expression(-log[10](p-value))) ggsave(snakemake@output[["dispersion"]], dispersion) #histograms hist_overall <- ggplot(gene_data, aes(x = pval)) + geom_histogram(binwidth = 0.025, boundary = 0) + xlab("p-values without grouping") + ylab("density") hist_groups <- ggplot(gene_data, aes(x = pval)) + geom_histogram(binwidth = 0.025, boundary = 0) + xlab("p-values of the individual groups") + ylab("density") + facet_wrap(~ grouping, nrow = ceiling(number_of_groups / 4)) plots_agg_mean_obs <- ggarrange(hist_overall, hist_groups, nrow = 1) histograms <- annotate_figure(plots_agg_mean_obs, top = text_grob("IHW: histograms for p-values of mean of counts", color = "black", face = "bold", size = 12)) ggexport(histograms, filename = snakemake@output[["histograms"]], width=14) # ihw calculation ihw_results_mean <- ihw(pval ~ mean_obs, data = gene_data, alpha = 0.1, nbins = tested_number_of_groups) # merging ens_gene-IDs and ext_gene-names ihw_mean <- as.data.frame(ihw_results_mean) %>% # TODO remove ugly hack if ihw in future allows annotation columns (feature requested here: https://support.bioconductor.org/p/129972/) right_join(gene_data, by = c(pvalue = "pval", covariate = "mean_obs", group = "grouping")) %>% unique() %>% select(ens_gene, ext_gene, everything()) write_tsv(ihw_mean, snakemake@output[["results"]]) ### diagnostic plots for ihw calculation # plot of trends of the covariate plot_trend_mean <- plot(ihw_results_mean) + ggtitle("IHW: Plot of trends of mean of counts") + theme(plot.title = element_text(size=12)) ggsave(snakemake@output[["trends"]], plot_trend_mean) # plots of decision boundaries for unweighted p-values as a function of the covariate plot_decision_mean <- plot(ihw_results_mean, what = "decisionboundary") + ggtitle("IHW: Decision boundaries for unweighted p-values vs. mean of counts") + theme(plot.title = element_text(size=12)) ggsave(snakemake@output[["decision"]], plot_decision_mean) # p-values vs. adusted p-values plot_ihw_pval_mean <- ggplot(ihw_mean, aes(x = pvalue, y = adj_pvalue, col = group)) + geom_point(size = 0.25) + ggtitle("IHW: raw p-values vs. adusted p-values from ihw-analysis") + theme(plot.title = element_text(size=12)) + scale_colour_hue(l = 70, c = 150, drop = FALSE) ggsave(snakemake@output[["adj_pvals"]], plot_ihw_pval_mean) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("tidyverse") library("sleuth") so <- sleuth_load(snakemake@input[["so"]]) top_n <- -strtoi(snakemake@params["top_n"]) results <- read_tsv(snakemake@input[["transcripts"]]) top_genes <- results %>% filter(qval <= snakemake@params[["fdr"]]) %>% top_n(top_n, qval) %>% dplyr::select(ext_gene) %>% drop_na() %>% distinct(ext_gene) genes_of_interest <- tibble( ext_gene = snakemake@params[["genes"]]) %>% distinct(ext_gene) genes <- full_join(top_genes, genes_of_interest, by = 'ext_gene') %>% add_row(ext_gene = "Custom") %>% pull(ext_gene) dir.create( snakemake@output[[1]] ) for (gene in genes) { transcripts <- results %>% filter(ext_gene == gene) %>% drop_na() %>% pull(target_id) if ( length( transcripts > 0 ) ) { for (transcript in transcripts) { plot_bootstrap(so, transcript, color_by = snakemake@params[["color_by"]], units = "tpm") ggsave(file = str_c(snakemake@output[[1]], "/", gene, ".", transcript, ".", snakemake@wildcards[["model"]] , ".bootstrap.pdf")) } } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") so <- sleuth_load(snakemake@input[["so"]]) diffexp <- read.table(snakemake@input[["diffexp"]], sep = "\t", header = TRUE) pdf(file = snakemake@output[[1]], width = 14) plot_transcript_heatmap(so, transcripts = diffexp$target_id[1:20], main="Top 20 differentially expressed transcripts (TPM)") # TODO, once pachterlab/sleuth#214 is merged, add this to get gene names # , labels_row = diffexp$ext_gene[1:20]) dev.off() |
1 2 3 4 5 6 7 8 9 10 11 12 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") library("ggplot2") library("tidyr") diffexp <- sleuth_load(snakemake@input[["diffexp_rds"]]) %>% drop_na(pval) ggplot(diffexp) + geom_histogram(aes(pval), bins = 100) ggsave(file = snakemake@output[[1]], width = 14) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("fgsea") # provides library("tidyverse") and function get_prefix_col() # the latter requires snakemake@output[["samples"]] and # snakemake@params[["covariate"]] source( file.path(snakemake@scriptdir, 'common.R') ) gene_sets <- gmtPathways(snakemake@input[["gene_sets"]]) sig_gene_sets <- read_tsv(snakemake@input[["sig_gene_sets"]]) diffexp <- read_tsv(snakemake@input[["diffexp"]]) %>% drop_na(ext_gene) %>% mutate(ext_gene = str_to_upper(ext_gene)) %>% group_by(ext_gene) %>% filter( qval == min(qval, na.rm = TRUE) ) %>% mutate(target_id = str_c(target_id, collapse=",")) %>% mutate(ens_gene = str_c(ens_gene, collapse=",")) %>% distinct() signed_pi <- get_prefix_col("signed_pi_value", colnames(diffexp)) ranked_genes <- diffexp %>% dplyr::select(ext_gene, !!signed_pi) %>% deframe() dir.create( snakemake@output[[1]] ) for ( set in names(gene_sets) ) { # plot gene set enrichment p <- plotEnrichment(gene_sets[[set]], ranked_genes) + ggtitle(str_c("gene set: ", set), subtitle = str_c( sig_gene_sets %>% filter(pathway == set) %>% pull(size)," genes; ", "p-value (BH-adjusted): ", sig_gene_sets %>% filter(pathway == set) %>% pull(padj), "\n", "normalized enrichment score (NES):", sig_gene_sets %>% filter(pathway == set) %>% pull(NES) ) ) + xlab("gene rank") + theme_bw( base_size = 16 ) ggsave( file = str_c( snakemake@output[[1]], "/", snakemake@wildcards[["model"]], ".", set, ".gene-set-plot.pdf"), width = 10, height = 7 ) } |
1 2 3 4 5 6 7 8 9 10 11 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") so <- sleuth_load(snakemake@input[[1]]) pdf(file = snakemake@output[[1]]) plot_fld(so, paste0(snakemake@wildcards[["sample"]], "-", snakemake@wildcards[["unit"]])) dev.off() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") library("tidyverse") so <- sleuth_load(snakemake@input[[1]]) plot_group_density(so, use_filtered = TRUE, units = "est_counts", trans = "log", grouping = setdiff(colnames(so$sample_to_covariates), "sample"), offset = 1) ggsave(snakemake@output[[1]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") library("ggpubr") so <- sleuth_load(snakemake@input[[1]]) #principal components pc <- 4 # plot pca so <- sleuth_load(snakemake@input[[1]]) pc12 <- plot_pca(so, color_by = snakemake@wildcards[["covariate"]], units = "est_counts", text_labels = TRUE) pc34 <- plot_pca(so, pc_x = 3, pc_y = 4, color_by = snakemake@wildcards[["covariate"]], units = "est_counts", text_labels = TRUE) ggarrange(pc12, pc34, ncol = 2, common.legend = TRUE) ggsave(snakemake@output[["pca"]], width=14) # plot pc variance plot_pc_variance(so, use_filtered = TRUE, units = "est_counts", pca_number = pc) ggsave(snakemake@output[["pc_var"]], width=14) # plot loadings pc_loading_plots <- list() for(i in 1:pc) { pc_loading_plots[[paste0("pc", i, "_loading")]] <- plot_loadings(so, scale = TRUE, pc_input = i, pc_count = 10, units = "est_counts") + ggtitle(paste0(snakemake@wildcards[["covariate"]], ": plot loadings for principal component ", i)) + theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5)) } plots_loading <- ggarrange(plotlist = pc_loading_plots, ncol = 1, nrow = 1, common.legend = TRUE) ggexport(plots_loading, filename = snakemake@output[["loadings"]], width = 14) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") library("tidyverse") library("ggpubr") so <- sleuth_load(snakemake@input[[1]]) plot_list <- list() # combinations between samples without repetition # sample_matrix <- t(combn(so$sample_to_covariates$sample, 2)) # combinations <- as.numeric(row(sample_matrix)) # # for(i in combinations) { # sample1 <- sample_matrix[i, 1] # sample2 <- sample_matrix[i, 2] # # plot_list[[i]] <- plot_scatter(so, # sample_x = sample1, # sample_y = sample2, # use_filtered = TRUE, # units = "est_counts", # point_alpha = 0.2, # xy_line = TRUE, # xy_line_color = "red") + # labs(x = sample1, y = sample2) # } # all combinations between samples (with repitition) samples <- so$sample_to_covariates$sample sample_matrix <- crossing(x = samples, y = samples) # creates all combinations between samples combinations <- as.numeric(row(sample_matrix)) for(i in combinations) { sample1 <- sample_matrix[i, 1] sample2 <- sample_matrix[i, 2] if(sample1 != sample2) { plot_list[[i]] <- plot_scatter(so, sample_x = sample1, sample_y = sample2, use_filtered = TRUE, units = "est_counts", point_alpha = 0.2, xy_line = TRUE, xy_line_color = "red") + labs(x = sample1, y = sample2) } else { x_val <- y_val <- c(0,0) test <- tibble(x_val, y_val) plot_list[[i]] <- ggplot(test, aes(x = x_val, y = y_val)) + theme_void() + geom_text(label = "") + annotate("text", label = sample1, x=0, y=0, colour = "blue", fontface = "bold") } } all_plots <- ggarrange(plotlist = plot_list) plot_figure <- annotate_figure(all_plots, top = text_grob("log transformed scatter plots of different samples", color = "black", face = "bold", size = 14)) ggsave(snakemake@output[[1]], plot_figure) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") sl <- snakemake@params[['sig_level']] so <- sleuth_load(snakemake@input[[1]]) # colors from colorblind-safe Brewer palette "Dark2": # http://colorbrewer2.org/#type=qualitative&scheme=Dark2&n=3 cb_safe_green <- "#1B9E77" cb_safe_red <- "#D95F02" p <- plot_vars(so, test = NULL, test_type = "lrt", which_model = "full", sig_level = sl, point_alpha = 0.2, sig_color = cb_safe_red, xy_line = TRUE, xy_line_color = cb_safe_red, highlight = NULL, highlight_color = cb_safe_green ) ggsave(filename = snakemake@output[[1]], width = 7, height = 7) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") library("tidyverse") library("fs") library("grid") library("gridExtra") model <- snakemake@params[["model"]] sleuth_object <- sleuth_load(snakemake@input[[1]]) sleuth_object <- sleuth_fit(sleuth_object, as.formula(model[["full"]]), 'full') sleuth_object <- sleuth_fit(sleuth_object, as.formula(model[["reduced"]]), 'reduced') sleuth_object <- sleuth_lrt(sleuth_object, "reduced", "full") # plot mean-variance mean_var_plot <- plot_mean_var(sleuth_object, which_model = "full", point_alpha = 0.4, point_size = 2, point_colors = c("black", "dodgerblue"), smooth_alpha = 1, smooth_size = 0.75, smooth_color = "red") ggsave(snakemake@output[["mean_var_plot"]], mean_var_plot) write_results <- function(so, mode, output, output_all) { so$pval_aggregate <- FALSE if(mode == "aggregate") { # workaround the following bug-request: # https://github.com/pachterlab/sleuth/pull/240 # TODO renaming can be removed when fixed g_col <- so$gene_column so$gene_column <- NULL so$pval_aggregate <- TRUE so$gene_column <- g_col } plot_model <- snakemake@wildcards[["model"]] # list for qq-plots to make a multipage pdf-file as output qq_list <- list() print("Performing likelihood ratio test") all <- sleuth_results(so, "reduced:full", "lrt", show_all = TRUE, pval_aggregate = so$pval_aggregate) %>% arrange(pval) covariates <- colnames(design_matrix(so, "full")) covariates <- covariates[covariates != "(Intercept)"] # iterate over all covariates and perform wald test in order to obtain beta estimates if(!so$pval_aggregate) { # lists for volcano and ma-plots to make a multipage pdf-file as output volcano_list <- list() ma_list <- list() for(covariate in covariates) { print(str_c("Performing wald test for ", covariate)) so <- sleuth_wt(so, covariate, "full") volc_plot_title <- str_c(plot_model, ": volcano plot for ", covariate) ma_plot_title <- str_c(plot_model, ": ma-plot for ", covariate) qq_plot_title <- str_c(plot_model, ": qq-plot from wald test for ", covariate) # volcano plot print(str_c("Performing volcano plot for ", covariate)) volcano <- plot_volcano(so, covariate, "wt", "full", sig_level = snakemake@params[["sig_level_volcano"]], point_alpha = 0.2, sig_color = "red", highlight = NULL) + ggtitle(volc_plot_title) + theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5)) volcano_list[[volc_plot_title]] <- volcano # ma-plot print(str_c("Performing ma-plot for ", covariate)) ma_plot <- plot_ma(so, covariate, "wt", "full", sig_level = snakemake@params[["sig_level_ma"]], point_alpha = 0.2, sig_color = "red", highlight = NULL, highlight_color = "green") + ggtitle(ma_plot_title) + theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5)) ma_list[[ma_plot_title]] <- ma_plot # qq-plots from wald test print(str_c("Performing qq-plot from wald test for ", covariate)) qq_plot <- plot_qq(so, covariate, "wt", "full", sig_level = snakemake@params[["sig_level_qq"]], point_alpha = 0.2, sig_color = "red", highlight = NULL, highlight_color = "green", line_color = "blue") + ggtitle(qq_plot_title) + theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5)) qq_list[[qq_plot_title]] <- qq_plot beta_col_name <- str_c("b", covariate, sep = "_") beta_se_col_name <- str_c(beta_col_name, "se", sep = "_") all_wald <- sleuth_results(so, covariate, "wt", show_all = TRUE, pval_aggregate = FALSE) %>% dplyr::select( target_id = target_id, !!beta_col_name := b, !!beta_se_col_name := se_b) signed_pi_col_name <- str_c("signed_pi_value", covariate, sep = "_") all <- inner_join(all, all_wald, by = "target_id") %>% # calculate a signed version of the pi-value from: # https://dx.doi.org/10.1093/bioinformatics/btr671 # e.g. useful for GSEA ranking mutate( !!signed_pi_col_name := -log10(pval) * !!sym(beta_col_name) ) } # saving volcano plots marrange_volcano <- marrangeGrob(grobs=volcano_list, nrow=1, ncol=1, top = NULL) ggsave(snakemake@output[["volcano_plots"]], plot = marrange_volcano, width = 14) # saving ma-plots marrange_ma <- marrangeGrob(grobs=ma_list, nrow=1, ncol=1, top = NULL) ggsave(snakemake@output[["ma_plots"]], plot = marrange_ma, width = 14) } if(mode == "mostsignificant") { # for each gene, select the most significant transcript (this is equivalent to sleuth_gene_table, but with bug fixes) all <- all %>% drop_na(ens_gene) %>% group_by(ens_gene) %>% filter( qval == min(qval, na.rm = TRUE) ) %>% # ties in qval (e.g. min(qval) == 1) can lead to multiple entries per gene filter( pval == min(pval, na.rm = TRUE) ) %>% # for min(qval) == 1, then usually also min(pval) == 1, and # we need something else to select a unique entry per gene filter( mean_obs == max(mean_obs, na.rm = TRUE) ) %>% # sometimes we still get two transcript variants with exactly # the same values, i.e. they have exactly the same sequence # but (slightly) different annotations -- then we retain a string # with a comma-separated list of them mutate(target_id = str_c(target_id, collapse=",")) %>% distinct() %>% # useful sort for scrolling through output by increasing q-values arrange(qval) # qq-plot from likelihood ratio test print(str_c("Performing qq-plot from likelihood ratio test")) qq_plot_title_trans <- str_c(plot_model, ": qq-plot from likelihood ratio test") qq_plot_trans <- plot_qq(so, test = 'reduced:full', test_type = 'lrt', sig_level = snakemake@params[["sig_level_qq"]], point_alpha = 0.2, sig_color = "red", highlight = NULL, highlight_color = "green", line_color = "blue") + ggtitle(qq_plot_title_trans) + theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5)) qq_list[[qq_plot_title_trans]] <- qq_plot_trans } # saving qq-plots marrange_qq <- marrangeGrob(grobs=qq_list, nrow=1, ncol=1, top = NULL) ggsave(snakemake@output[["qq_plots"]], plot = marrange_qq, width = 14) write_tsv(all, path = output, quote_escape = "none") write_rds(all, path = output_all, compress = "none") } write_results(sleuth_object, "transcripts", snakemake@output[["transcripts"]], snakemake@output[["transcripts_rds"]]) write_results(sleuth_object, "aggregate", snakemake@output[["genes_aggregated"]], snakemake@output[["genes_aggregated_rds"]]) write_results(sleuth_object, "mostsignificant", snakemake@output[["genes_mostsigtrans"]], snakemake@output[["genes_mostsigtrans_rds"]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") library("biomaRt") library("tidyverse") model <- snakemake@params[["model"]] # this variable holds a mirror name until # useEnsembl succeeds ("www" is last, because # of very frequent "Internal Server Error"s) mart <- "useast" rounds <- 0 while ( class(mart)[[1]] != "Mart" ) { mart <- tryCatch( { # done here, because error function does not # modify outer scope variables, I tried if (mart == "www") rounds <- rounds + 1 # equivalent to useMart, but you can choose # the mirror instead of specifying a host biomaRt::useEnsembl( biomart = "ENSEMBL_MART_ENSEMBL", dataset = str_c(snakemake@params[["species"]], "_gene_ensembl"), mirror = mart ) }, error = function(e) { # change or make configurable if you want more or # less rounds of tries of all the mirrors if (rounds >= 3) { stop( str_c( "Have tried all 4 available Ensembl biomaRt mirrors ", rounds, " times. You might have a connection problem, or no mirror is responsive." ) ) } # hop to next mirror mart <- switch(mart, useast = "uswest", uswest = "asia", asia = "www", www = { # wait before starting another round through the mirrors, # hoping that intermittent problems disappear Sys.sleep(30) "useast" } ) } ) } t2g <- biomaRt::getBM( attributes = c( "ensembl_transcript_id", "ensembl_gene_id", "external_gene_name"), mart = mart, useCache = FALSE ) %>% rename( target_id = ensembl_transcript_id, ens_gene = ensembl_gene_id, ext_gene = external_gene_name ) samples <- read_tsv(snakemake@input[["samples"]], na = "", col_names = TRUE) %>% # make everything except the index, sample name and path string a factor mutate_at( vars(-X1, -sample, -path), list(~factor(.)) ) if(!is.null(snakemake@params[["exclude"]])) { samples <- samples %>% filter( !sample %in% snakemake@params[["exclude"]] ) } if(!is.null(model)) { # retrieve the model formula formula <- as.formula(model) # extract variables from the formula and unnest any nested variables variables <- labels(terms(formula)) %>% strsplit('[:*]') %>% unlist() # remove samples with an NA value in any of the columns # relevant for sleuth under the current model samples <- samples %>% drop_na( c( sample, path, all_of(variables) ) ) } so <- sleuth_prep( samples, extra_bootstrap_summary = TRUE, target_mapping = t2g, aggregation_column = "ens_gene", read_bootstrap_tpm = TRUE, transform_fun_counts = function(x) log2(x + 0.5), num_cores = snakemake@threads ) custom_transcripts <- so$obs_raw %>% # find transcripts not in the target_mapping filter(!target_id %in% so$target_mapping$target_id) %>% # make it a succinct list with no repititions distinct(target_id) %>% # pull it out into a vector pull(target_id) if(!length(custom_transcripts) == 0) { so$target_mapping <- so$target_mapping %>% # add those custom transcripts as rows to the target mapping add_row(ens_gene = NA, ext_gene = "Custom", target_id = custom_transcripts) } sleuth_save(so, snakemake@output[[1]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") so <- sleuth_load(snakemake@input[[1]]) tpm <- sleuth_to_matrix(so, "obs_norm", "tpm") target_mapping <- so$target_mapping rownames(target_mapping) <- target_mapping$target_id tpm <- cbind(ext_gene = target_mapping[rownames(tpm), "ext_gene"], tpm) write.table(tpm, file = snakemake@output[[1]], col.names = NA, row.names = TRUE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("SPIA") library("graphite") # provides library("tidyverse") and functions load_bioconductor_package() and # get_prefix_col(), the latter requires snakemake@output[["samples"]] and # snakemake@params[["covariate"]] source( file.path(snakemake@scriptdir, 'common.R') ) pkg <- snakemake@params[["bioc_pkg"]] load_bioconductor_package(snakemake@input[["species_anno"]], pkg) pw_db <- snakemake@params[["pathway_db"]] db <- pathways(snakemake@params[["species"]], pw_db) db <- convertIdentifiers(db, "ENSEMBL") options(Ncpus = snakemake@threads) diffexp <- read_tsv(snakemake@input[["diffexp"]]) %>% drop_na(ens_gene) %>% mutate(ens_gene = str_c("ENSEMBL:", ens_gene)) universe <- diffexp %>% pull(var = ens_gene) sig_genes <- diffexp %>% filter(qval <= 0.05) # get logFC equivalent (the sum of beta scores of covariates of interest) beta_col <- get_prefix_col("b", colnames(sig_genes)) beta <- sig_genes %>% dplyr::select(ens_gene, !!beta_col) %>% deframe() t <- tempdir(check=TRUE) olddir <- getwd() setwd(t) prepareSPIA(db, pw_db) res <- runSPIA(de = beta, all = universe, pw_db, plots = TRUE, verbose = TRUE) setwd(olddir) file.copy(file.path(t, "SPIAPerturbationPlots.pdf"), snakemake@output[["plots"]]) write_tsv(res, snakemake@output[["table"]]) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell n = len(snakemake.input) assert n == 2, "Input must contain 2 (paired-end) elements." log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "cutadapt" " {snakemake.params}" " -o {snakemake.output.fastq1}" " -p {snakemake.output.fastq2}" " {snakemake.input}" " > {snakemake.output.qc} {log}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "cutadapt" " {snakemake.params}" " -o {snakemake.output.fastq}" " {snakemake.input[0]}" " > {snakemake.output.qc} {log}") |
Support
- Future updates
Related Workflows





