Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This workflow performs a differential expression analysis with Kallisto and Sleuth .
Authors
-
Johannes Köster (https://koesterlab.github.io)
-
David Lähnemann
Usage
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).
Step 1: Obtain a copy of this workflow
-
Create a new github repository using this workflow as a template .
-
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
Step 2: Configure workflow
Configure the workflow according to your needs via editing the file
config/config.yaml
, as well as the sample sheet
config/samples.tsv
, and the unit sheet
config/units.tsv
.
Step 3: Execute workflow
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.
Step 4: Investigate results
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .
Step 5: Commit changes
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
Step 6: Contribute back
In case you have also changed or added steps, please consider contributing them back to the original repository:
-
Fork the original repo to a personal or lab account.
-
Clone the fork to your local system, to a different place than where you ran your analysis.
-
Copy the modified files from your analysis to the clone of your fork, e.g.,
cp -r workflow path/to/fork
. Make sure to not accidentally copy config file contents or sample sheets. -
Commit and push your changes to your fork.
-
Create a pull request against the original repository.
Testing
Test cases are in the subfolder
.test
. They are automtically executed via continuous integration with
Github Actions
.
Code Snippets
13 14 15 16 17 18 19 | run: samples_ = units[["sample", "unit"]].merge(samples, on="sample") samples_["sample"] = samples_.apply( lambda row: "{}-{}".format(row["sample"], row["unit"]), axis=1) samples_["path"] = kallisto_output del samples_["unit"] samples_.to_csv(output[0], sep="\t") |
43 44 | script: "../scripts/sleuth-init.R" |
69 70 | script: "../scripts/sleuth-diffexp.R" |
84 85 | script: "../scripts/plot-bootstrap.R" |
97 98 | script: "../scripts/plot-pca.R" |
113 114 | script: "../scripts/plot-diffexp-heatmap.R" |
128 129 | script: "../scripts/plot-diffexp-pval-hist.R" |
141 142 | script: "../scripts/sleuth-to-matrix.R" |
154 155 | script: "../scripts/plot-fld.R" |
169 170 | script: "../scripts/plot-variances.R" |
16 17 | script: "../scripts/spia.R" |
58 59 | script: "../scripts/fgsea.R" |
80 81 | script: "../scripts/fgsea_plot_gene_set.R" |
94 95 | script: "../scripts/biomart-ens_gene_to_go.R" |
107 108 | shell: "( curl --silent -o {output} {params.download} ) 2> {log}" |
137 138 | script: "../scripts/goatools-go-enrichment-analysis.py" |
10 11 | shell: "kallisto index -i {output} {input} 2> {log}" |
39 40 41 | shell: "kallisto quant -i {input.idx} -o {output} " "{params.extra} {input.fq} 2> {log}" |
12 13 | wrapper: "0.31.1/bio/cutadapt/pe" |
26 27 | wrapper: "0.31.1/bio/cutadapt/se" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("biomaRt") library("tidyverse") # create an ensembl biomart for the species specified in the params field mart <- biomaRt::useMart( biomart = "ENSEMBL_MART_ENSEMBL", dataset = str_c(snakemake@params[["species"]], "_gene_ensembl"), host = 'ensembl.org') # get all ensembl_gene_id - go_id pairs and collapse into key->values mapping: # ensembl_gene_id -> go_id;go_id;go_id ens_gene_to_go <- biomaRt::getBM(attributes = c("ensembl_gene_id", "go_id"), mart = mart) %>% # rows with empty go_id are useless and will lead to extra semicolons below filter(go_id != "") %>% group_by(ensembl_gene_id) %>% summarise(go_ids = str_c(go_id, collapse = ";")) write_tsv(ens_gene_to_go, path = snakemake@output[[1]], col_names = FALSE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("fgsea") # provides library("tidyverse") and function get_prefix_col() # the latter requires snakemake@output[["samples"]] and # snakemake@params[["covariate"]] source( file.path(snakemake@scriptdir, 'common.R') ) gene_sets <- gmtPathways(snakemake@input[["gene_sets"]]) sig_gene_sets <- read_tsv(snakemake@input[["sig_gene_sets"]]) diffexp <- read_tsv(snakemake@input[["diffexp"]]) %>% drop_na(ext_gene) %>% mutate(ext_gene = str_to_upper(ext_gene)) %>% group_by(ext_gene) %>% filter( qval == min(qval, na.rm = TRUE) ) %>% mutate(target_id = str_c(target_id, collapse=",")) %>% mutate(ens_gene = str_c(ens_gene, collapse=",")) %>% distinct() signed_pi <- get_prefix_col("signed_pi_value", colnames(diffexp)) ranked_genes <- diffexp %>% dplyr::select(ext_gene, !!signed_pi) %>% deframe() set <- snakemake@wildcards[["gene_set"]] # plot gene set enrichment p <- plotEnrichment(gene_sets[[set]], ranked_genes) + ggtitle(str_c("gene set: ", set), subtitle = str_c( sig_gene_sets %>% filter(pathway == set) %>% pull(size), " genes; ", "p-value (BH-adjusted): ", sig_gene_sets %>% filter(pathway == set) %>% pull(padj), "\n", "normalized enrichment score (NES):", sig_gene_sets %>% filter(pathway == set) %>% pull(NES) ) ) + xlab("gene rank") + theme_bw(base_size=16) ggsave(snakemake@output[[1]], width=10, height=7) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("fgsea") library("AnnotationDbi") # provides library("tidyverse") and function get_prefix_col() # the latter requires snakemake@output[["samples"]] and # snakemake@params[["covariate"]] source( file.path(snakemake@scriptdir, 'common.R') ) # get the species for adding annotations below species <- str_to_title( str_sub(snakemake@params[["species"]], 1, 2) ) # construct the respective package name pkg <- str_c("org.", species, ".eg.db") # check if the package is installed in the fgsea environment and load, # give a useful error message otherwise; installed per default are: # * org.Mm.eg.db # * org.Hs.eg.db if ( pkg %in% rownames( installed.packages() ) ) { library(pkg, character.only = TRUE) } else { stop( str_c( "\n", "Package not installed: ", pkg, "\n", "Package name inferred from species name: ", species, "\n", "Check if package 'bioconductor-", pkg, "' exists and add to envs/fgsea.yaml\n", "\n" ) ) } gene_sets <- gmtPathways(snakemake@input[["gene_sets"]]) diffexp <- read_tsv(snakemake@input[["diffexp"]]) %>% drop_na(ext_gene) %>% mutate(ext_gene = str_to_upper(ext_gene)) %>% group_by(ext_gene) %>% filter( qval == min(qval, na.rm = TRUE) ) %>% mutate(target_id = str_c(target_id, collapse=",")) %>% mutate(ens_gene = str_c(ens_gene, collapse=",")) %>% distinct() signed_pi <- get_prefix_col("signed_pi_value", colnames(diffexp)) ranked_genes <- diffexp %>% dplyr::select(ext_gene, !!signed_pi) %>% deframe() # get and write out rank values that are tied -- a way to check up on respecitve warnings rank_ties <- enframe(ranked_genes) %>% mutate(dup = duplicated(value) | duplicated(value, fromLast = TRUE) ) %>% filter(dup == TRUE) %>% dplyr::select(ext_gene = name, !!signed_pi := value) write_tsv(rank_ties, snakemake@output[["rank_ties"]]) fgsea_res <- fgsea(pathways = gene_sets, stats = ranked_genes, minSize=10, maxSize=700, nproc=snakemake@threads, nperm=snakemake@params[["nperm"]] ) %>% as_tibble() # Annotation is impossible without any entries, so then just write out empty files if ( (fgsea_res %>% count() %>% pull(n)) == 0 ) { write_tsv(fgsea_res, path = snakemake@output[["enrichment"]]) write_tsv(fgsea_res, path = snakemake@output[["significant"]]) } else { # add further annotation annotated <- fgsea_res %>% unnest(leadingEdge) %>% mutate( leading_edge_symbol = str_to_title(leadingEdge), leading_edge_entrez_id = mapIds(leading_edge_symbol, x=get(pkg), keytype="SYMBOL", column="ENTREZID"), leading_edge_ens_gene = mapIds(leading_edge_symbol, x=get(pkg), keytype="SYMBOL", column="ENSEMBL") ) %>% group_by(pathway) %>% summarise( leadingEdge = str_c(leadingEdge, collapse = ','), leading_edge_symbol = str_c(leading_edge_symbol, collapse = ','), leading_edge_entrez_id = str_c(leading_edge_entrez_id, collapse = ','), leading_edge_ens_gene = str_c(leading_edge_ens_gene, collapse = ',') ) %>% inner_join(fgsea_res %>% select(-leadingEdge), by = "pathway") %>% select(-leadingEdge, -leading_edge_symbol, -leading_edge_entrez_id, -leading_edge_ens_gene, leading_edge_symbol, leading_edge_ens_gene, leading_edge_entrez_id, leadingEdge) # write out fgsea results for all gene sets write_tsv(annotated, path = snakemake@output[["enrichment"]]) # select significant pathways sig_gene_sets <- annotated %>% filter( padj < snakemake@params[["gene_set_fdr"]] ) # write out fgsea results for gene sets found to be significant write_tsv(sig_gene_sets, path = snakemake@output[["significant"]]) } height = .7 * (length(gene_sets) + 2) # table plot of all gene sets tg <- plotGseaTable( pathway = gene_sets, stats = ranked_genes, fgseaRes = fgsea_res, gseaParam = 1, render = FALSE ) ggsave(filename = snakemake@output[["plot"]], plot = tg, width = 12, height = height, limitsize=FALSE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | import sys sys.stderr = open(snakemake.log[0], "w") import pandas as pd import matplotlib.pyplot as plt from goatools.obo_parser import GODag from goatools.anno.idtogos_reader import IdToGosReader from goatools.goea.go_enrichment_ns import GOEnrichmentStudyNS from goatools.godag_plot import plot_results#, plot_goid2goobj, plot_gos # read in directed acyclic graph of GO terms / IDs obodag = GODag(snakemake.input.obo) # read in mapping gene ids from input to GO terms / IDs objanno = IdToGosReader(snakemake.input.ens_gene_to_go, godag = obodag) # extract namespace(?) -> id2gos mapping ns2assoc = objanno.get_ns2assc() for nspc, id2gos in ns2assoc.items(): print("{NS} {N:,} annotated mouse genes".format(NS=nspc, N=len(id2gos))) # read gene diffexp table all_genes = pd.read_table(snakemake.input.diffexp) # select genes significantly differentially expressed according to BH FDR of sleuth fdr_level_gene = float(snakemake.params.gene_fdr) sig_genes = all_genes[all_genes['qval']<fdr_level_gene] # initialize GOEA object fdr_level_go_term = float(snakemake.params.go_term_fdr) goeaobj = GOEnrichmentStudyNS( # list of 'population' of genes looked at in total pop = all_genes['ens_gene'].tolist(), # geneid -> GO ID mapping ns2assoc = ns2assoc, # ontology DAG godag = obodag, propagate_counts = False, # multiple testing correction method (fdr_bh is false discovery rate control with Benjamini-Hochberg) methods = ['fdr_bh'], # significance cutoff for method named above alpha = fdr_level_go_term ) goea_results_all = goeaobj.run_study(sig_genes['ens_gene'].tolist()) # write results to text file goeaobj.wr_tsv(snakemake.output.enrichment, goea_results_all) # plot results ensembl_id_to_symbol = dict(zip(all_genes['ens_gene'], all_genes['ext_gene'])) # from first plot output file name, create generic file name to trigger # separate plots for each of the gene ontology name spaces outplot_generic = snakemake.output.plot[0].replace('_BP.','_{NS}.', 1).replace('_CC.','_{NS}.', 1).replace('_MF.', '_{NS}.', 1) goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < fdr_level_go_term] plot_results( outplot_generic, # use pvals for coloring goea_results=goea_results_sig, # print general gene symbol instead of Ensembl ID id2symbol=ensembl_id_to_symbol, # number of genes to print, or True for printing all (including count of genes) study_items=True, # number of genes to print per line items_p_line=6, # p-values to use for coloring of GO term nodes (argument name determined from code and testing against value "p_uncorrected") pval_name="p_fdr_bh" ) # for all name spaces for ns in ns2assoc.keys(): # check if no GO terms were found to be significant if len([r for r in goea_results_sig if r.NS == ns]) == 0: fig = plt.figure(figsize=(12, 2)) text = fig.text(0.5, 0.5, "No plot generated, because no GO terms were found significant\n" "for name space {} and significance levels: genes ({}), GO terms ({}).\n" "You might want to check those levels and/or your intermediate data.".format( ns, fdr_level_gene, fdr_level_go_term), ha='center', va='center', size=20) fig.savefig( outplot_generic.replace('_{NS}.', "_{}.".format(ns)) ) |
1 2 3 4 5 6 7 8 9 10 11 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") so <- sleuth_load(snakemake@input[[1]]) pdf(file = snakemake@output[[1]]) plot_bootstrap(so, snakemake@wildcards[["transcript"]], color_by = snakemake@params[["color_by"]], units = "tpm") dev.off() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") so <- sleuth_load(snakemake@input[["so"]]) diffexp <- read.table(snakemake@input[["diffexp"]], sep = "\t", header = TRUE) pdf(file = snakemake@output[[1]], width = 14) plot_transcript_heatmap(so, transcripts = diffexp$target_id[1:20]) # TODO, once pachterlab/sleuth#214 is merged, add this to get gene names # , labels_row = diffexp$ext_gene[1:20]) dev.off() |
1 2 3 4 5 6 7 8 9 10 11 12 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") library("ggplot2") library("tidyr") diffexp <- sleuth_load(snakemake@input[["diffexp_rds"]]) %>% drop_na(pval) ggplot(diffexp) + geom_histogram(aes(pval), bins = 100) ggsave(file = snakemake@output[[1]], width = 14) |
1 2 3 4 5 6 7 8 9 10 11 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") so <- sleuth_load(snakemake@input[[1]]) pdf(file = snakemake@output[[1]]) plot_fld(so, paste0(snakemake@wildcards[["sample"]], "-", snakemake@wildcards[["unit"]])) dev.off() |
1 2 3 4 5 6 7 8 9 10 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") so <- sleuth_load(snakemake@input[[1]]) pdf(file = snakemake@output[[1]]) plot_pca(so, color_by = snakemake@wildcards[["covariate"]], units = "tpm", text_labels = TRUE) dev.off() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") sl <- snakemake@params[['sig_level']] so <- sleuth_load(snakemake@input[[1]]) # colors from colorblind-safe Brewer palette "Dark2": # http://colorbrewer2.org/#type=qualitative&scheme=Dark2&n=3 cb_safe_green <- "#1B9E77" cb_safe_red <- "#D95F02" p <- plot_vars(so, test = NULL, test_type = "lrt", which_model = "full", sig_level = sl, point_alpha = 0.2, sig_color = cb_safe_red, xy_line = TRUE, xy_line_color = cb_safe_red, highlight = NULL, highlight_color = cb_safe_green ) ggsave(filename = snakemake@output[[1]], width = 7, height = 7) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") library("tidyverse") model <- snakemake@params[["model"]] so <- sleuth_load(snakemake@input[[1]]) so <- sleuth_fit(so, as.formula(model[["full"]]), 'full') so <- sleuth_fit(so, as.formula(model[["reduced"]]), 'reduced') so <- sleuth_lrt(so, "reduced", "full") write_results <- function(mode, output, output_all) { aggregate <- FALSE if(mode == "aggregate") { aggregate <- TRUE } print("Performing likelihood ratio test") all <- sleuth_results(so, "reduced:full", "lrt", show_all = TRUE, pval_aggregate = aggregate) %>% arrange(pval) covariates <- colnames(design_matrix(so, "full")) covariates <- covariates[covariates != "(Intercept)"] # iterate over all covariates and perform wald test in order to obtain beta estimates if(!aggregate) { for(covariate in covariates) { print(str_c("Performing wald test for ", covariate)) so <- sleuth_wt(so, covariate, "full") beta_col_name <- str_c("b", covariate, sep = "_") beta_se_col_name <- str_c(beta_col_name, "se", sep = "_") all_wald <- sleuth_results(so, covariate, "wt", show_all = TRUE, pval_aggregate = FALSE) %>% select( target_id = target_id, !!beta_col_name := b, !!beta_se_col_name := se_b) signed_pi_col_name <- str_c("signed_pi_value", covariate, sep = "_") all <- inner_join(all, all_wald, by = "target_id") %>% # calculate a signed version of the pi-value from: # https://dx.doi.org/10.1093/bioinformatics/btr671 # e.g. useful for GSEA ranking mutate( !!signed_pi_col_name := -log10(pval) * !!sym(beta_col_name) ) } } if(mode == "mostsignificant") { # for each gene, select the most significant transcript (this is equivalent to sleuth_gene_table, but with bug fixes) all <- all %>% drop_na(ens_gene) %>% group_by(ens_gene) %>% filter( qval == min(qval, na.rm = TRUE) ) %>% # ties in qval (e.g. min(qval) == 1) can lead to multiple entries per gene filter( pval == min(pval, na.rm = TRUE) ) %>% # for min(qval) == 1, then usually also min(pval) == 1, and # we need something else to select a unique entry per gene filter( mean_obs == max(mean_obs, na.rm = TRUE) ) %>% # sometimes we still get two transcript variants with exactly # the same values, i.e. they have exactly the same sequence # but (slightly) different annotations -- then we retain a string # with a comma-separated list of them mutate(target_id = str_c(target_id, collapse=",")) %>% distinct() %>% # useful sort for scrolling through output by increasing q-values arrange(qval) } write_tsv(all, path = output, quote_escape = "none") write_rds(all, path = output_all, compress = "none") } write_results("transcripts", snakemake@output[["transcripts"]], snakemake@output[["transcripts_rds"]]) write_results("aggregate", snakemake@output[["genes_aggregated"]], snakemake@output[["genes_aggregated_rds"]]) write_results("mostsignificant", snakemake@output[["genes_mostsigtrans"]], snakemake@output[["genes_mostsigtrans_rds"]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") library("biomaRt") library("tidyverse") model <- snakemake@params[["model"]] mart <- biomaRt::useMart( biomart = "ENSEMBL_MART_ENSEMBL", dataset = str_c(snakemake@params[["species"]], "_gene_ensembl"), host = 'ensembl.org' ) t2g <- biomaRt::getBM( attributes = c( "ensembl_transcript_id", "ensembl_gene_id", "external_gene_name"), mart = mart ) %>% rename( target_id = ensembl_transcript_id, ens_gene = ensembl_gene_id, ext_gene = external_gene_name ) samples <- read_tsv(snakemake@input[["samples"]], na = "", col_names = TRUE) %>% # make everything except the index, sample name and path string a factor mutate_at( vars(-X1, -sample, -path), list(~factor(.)) ) if(!is.null(snakemake@params[["exclude"]])) { samples <- samples %>% filter( !sample %in% snakemake@params[["exclude"]] ) } if(!is.null(model)) { # retrieve the model formula formula <- as.formula(snakemake@params[["model"]]) # extract variables from the formula and unnest any nested variables variables <- labels(terms(formula)) %>% strsplit('[:*]') %>% unlist() # select the columns required by sleuth and filter to all samples where # none of the given variables are NA samples <- samples %>% select(sample, path, condition, variables) %>% drop_na() } so <- sleuth_prep( samples, extra_bootstrap_summary = TRUE, target_mapping = t2g, aggregation_column = "ens_gene", read_bootstrap_tpm = TRUE, transform_fun_counts = function(x) log2(x + 0.5) ) custom_transcripts <- so$obs_raw %>% # find transcripts not in the target_mapping filter(!target_id %in% so$target_mapping$target_id) %>% # make it a succinct list with no repititions distinct(target_id) %>% # pull it out into a vector pull(target_id) so$target_mapping <- so$target_mapping %>% # add those custom transcripts as rows to the target mapping add_row(ens_gene = NA, ext_gene = "Custom", target_id = custom_transcripts) sleuth_save(so, snakemake@output[[1]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("sleuth") so <- sleuth_load(snakemake@input[[1]]) tpm <- sleuth_to_matrix(so, "obs_norm", "tpm") target_mapping <- so$target_mapping rownames(target_mapping) <- target_mapping$target_id tpm <- cbind(ext_gene = target_mapping[rownames(tpm), "ext_gene"], tpm) write.table(tpm, file = snakemake@output[[1]], col.names = NA, row.names = TRUE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("SPIA") library("graphite") library("AnnotationDbi") # provides library("tidyverse") and function get_prefix_col() # the latter requires snakemake@output[["samples"]] and # snakemake@params[["covariate"]] source( file.path(snakemake@scriptdir, 'common.R') ) options(Ncpus = snakemake@threads) pw_db <- snakemake@params[["pathway_db"]] db <- pathways(snakemake@params[["species"]], pw_db) db <- convertIdentifiers(db, "ENSEMBL") prepareSPIA(db, pw_db) diffexp <- read_tsv(snakemake@input[["diffexp"]]) %>% drop_na(ens_gene) %>% mutate(ens_gene = str_c("ENSEMBL", ens_gene)) universe <- diffexp %>% pull(var = ens_gene) sig_genes <- diffexp %>% filter(qval <= 0.05) # get logFC equivalent (the sum of beta scores of covariates of interest) beta_col <- get_prefix_col("b", colnames(sig_genes)) beta <- sig_genes %>% select(ens_gene, !!beta_col) %>% deframe() res <- runSPIA(de = beta, all = universe, pw_db, plots = TRUE) write_tsv(res, snakemake@output[[1]]) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell n = len(snakemake.input) assert n == 2, "Input must contain 2 (paired-end) elements." log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "cutadapt" " {snakemake.params}" " -o {snakemake.output.fastq1}" " -p {snakemake.output.fastq2}" " {snakemake.input}" " > {snakemake.output.qc} {log}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "cutadapt" " {snakemake.params}" " -o {snakemake.output.fastq}" " {snakemake.input[0]}" " > {snakemake.output.qc} {log}") |
Support
- Future updates
Related Workflows





