Snakemake template for building reusable and scalable machine learning pipelines with mikropml

public 1yr ago Version: v1.3.0 0 bookmarks

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Snakemake is a workflow manager that enables massively parallel and reproducible analyses. Snakemake is a suitable tool to use when you can break a workflow down into discrete steps, with each step having input and output files.

mikropml is an R package for supervised machine learning pipelines. We provide this example workflow as a template to get started running mikropml with snakemake. We hope you then customize the code to meet the needs of your particular ML task.

For more details on these tools, see the Snakemake tutorial and read the mikropml docs .

The Workflow

The Snakefile contains rules which define the output files we want and how to make them. Snakemake automatically builds a directed acyclic graph (DAG) of jobs to figure out the dependencies of each of the rules and what order to run them in. This workflow preprocesses the example dataset, calls mikropml::run_ml() for each seed and ML method set in the config file, combines the results files, plots performance results (cross-validation and test AUROCs, hyperparameter AUROCs from cross-validation, and benchmark performance), and renders a simple R Markdown report as a GitHub-flavored markdown file ( see example here ).

rulegraph

The DAG shows how calls to run_ml can run in parallel if snakemake is allowed to run more than one job at a time. If we use 100 seeds and 4 ML methods, snakemake would call run_ml 400 times. Here's a small example DAG if we were to use only 2 seeds and 1 ML method:

dag

Usage

Full usage instructions recommended by snakemake are available in the snakemake workflow catalog . Snakemake recommends using snakedeploy to use this workflow as a module in your own project.

Alternatively, you can download this repo and modify the code directly to suit your needs. See instructions here .

Help & Contributing

If you come across a bug, open an issue and include a minimal reproducible example.

If you have questions, create a new post in Discussions .

If you’d like to contribute, see our guidelines here .

Code of Conduct

Please note that the mikropml-snakemake-workflow is released with a Contributor Code of Conduct . By contributing to this project, you agree to abide by its terms.

More resources

Code Snippets

script:
    "../scripts/combine_results.R"

SnakeMake From line 17 of rules/combine.smk

script:
    "../scripts/combine_hp_perf.R"

SnakeMake From line 34 of rules/combine.smk

script:
    "../scripts/mutate_benchmark.R"

SnakeMake From line 47 of rules/combine.smk

shell:
    """
    for f in {input.figs}; do
        cp $f {params.outdir}
    done
    """

SnakeMake From line 28 of rules/example-report.smk

script:
    "../scripts/report.Rmd"

SnakeMake From line 58 of rules/example-report.smk

script:
    "../scripts/preproc.R"

SnakeMake From line 17 of rules/learn.smk

script:
    "../scripts/train_ml.R"

SnakeMake From line 43 of rules/learn.smk

script:
    "../scripts/find_feature_importance.R"

SnakeMake From line 64 of rules/learn.smk

script:
    "../scripts/calc_model_sensspec.R"

SnakeMake From line 80 of rules/learn.smk

script:
    "../scripts/plot_performance.R"

SnakeMake From line 14 of rules/plot.smk

script:
    "../scripts/plot_feature_importance.R"

SnakeMake From line 34 of rules/plot.smk

script:
    "../scripts/make_blank_plot.R"

SnakeMake From line 46 of rules/plot.smk

script:
    "../scripts/plot_hp_perf.R"

SnakeMake From line 63 of rules/plot.smk

script:
    "../scripts/plot_benchmarks.R"

SnakeMake From line 81 of rules/plot.smk

script:
    "../scripts/plot_roc_curves.R"

SnakeMake From line 94 of rules/plot.smk

shell:
    """
    snakemake --{wildcards.cmd} --configfile {params.config_path} 2> {log} > {output.dot}
    """

SnakeMake Snakemake From line 107 of rules/plot.smk

shell:
    """
    cat {input.dot} | dot -T png 2> {log} > {output.png}
    """

SnakeMake From line 122 of rules/plot.smk

schtools::log_snakemake()
library(tidyverse)

model <- read_rds(snakemake@input[["model"]])
test_dat <- read_csv(snakemake@input[["test"]])
outcome_colname <- snakemake@params[["outcome_colname"]]
mikropml::calc_model_sensspec(
  model,
  test_dat,
  outcome_colname
) %>%
  bind_cols(schtools::get_wildcards_tbl()) %>%
  write_csv(snakemake@output[["csv"]])

R tidyverse From line 1 of scripts/calc_model_sensspec.R

schtools::log_snakemake()

models <- lapply(snakemake@input[["rds"]], function(x) readRDS(x))
hp_perf <- mikropml::combine_hp_performance(models)
hp_perf$method <- snakemake@wildcards[["method"]]
saveRDS(hp_perf, file = snakemake@output[["rds"]])

R From line 1 of scripts/combine_hp_perf.R

schtools::log_snakemake()
library(dplyr)

snakemake@input[["csv"]] %>%
  purrr::map_dfr(readr::read_csv) %>%
  readr::write_csv(snakemake@output[["csv"]])

R dplyr From line 1 of scripts/combine_results.R

schtools::log_snakemake()
library(mikropml)
library(dplyr)
library(readr)
doFuture::registerDoFuture()
future::plan(future::multicore, workers = snakemake@threads)
message(paste("# workers: ", foreach::getDoParWorkers()))

model <- readRDS(snakemake@input[["model"]])
outcome_colname <- snakemake@params[["outcome_colname"]]
train_dat <- model$trainingData
names(train_dat)[names(train_dat) == ".outcome"] <- outcome_colname
test_dat <- read_csv(snakemake@input[["test"]])
method <- snakemake@params[["method"]]
seed <- as.numeric(snakemake@params[["seed"]])

outcome_type <- get_outcome_type(c(
  train_dat %>% pull(outcome_colname),
  test_dat %>% pull(outcome_colname)
))
class_probs <- outcome_type != "continuous"
perf_metric_function <- get_perf_metric_fn(outcome_type)
perf_metric_name <- get_perf_metric_name(outcome_type)

if (!is.na(seed)) {
  set.seed(seed)
}
feat_imp <- mikropml::get_feature_importance(
  trained_model = model,
  test_data = test_dat,
  outcome_colname = outcome_colname,
  perf_metric_function = perf_metric_function,
  perf_metric_name = perf_metric_name,
  class_probs = class_probs,
  method = method,
  seed = seed,
)

wildcards <- schtools::get_wildcards_tbl()

readr::write_csv(
  feat_imp %>%
    inner_join(wildcards, by = c("method", "seed")),
  snakemake@output[["feat"]]
)

R dplyr readr mikropml From line 1 of scripts/find_feature_importance.R

schtools::log_snakemake()
library(ggplot2)
message("making a blank plot")
ggsave(
  filename = snakemake@output[["plot"]],
  plot = ggplot() +
    theme_void(),
  height = 0.1, width = 0.1,
  device = "png"
)

R ggplot2 From line 1 of scripts/make_blank_plot.R

schtools::log_snakemake()
library(tidyverse)

wildcards <- schtools::get_wildcards_tbl()

read_tsv(snakemake@input[["tsv"]]) %>%
  bind_cols(wildcards) %>%
  write_csv(snakemake@output[["csv"]])

R tidyverse From line 1 of scripts/mutate_benchmark.R

schtools::log_snakemake()
library(tidyverse)

dat <- read_csv(snakemake@input[["csv"]],
  col_types = cols(
    s = col_double(),
    `h:m:s` = col_time(format = "%H:%M:%S"),
    max_rss = col_double(),
    max_vms = col_double(),
    max_uss = col_double(),
    max_pss = col_double(),
    io_in = col_double(),
    io_out = col_double(),
    mean_load = col_double(),
    cpu_time = col_double(),
    method = col_character(),
    seed = col_double()
  )
) %>%
  mutate(
    runtime_mins = s / 60,
    memory_gb = max_rss / 1024
  ) %>%
  select(method, runtime_mins, memory_gb) %>%
  pivot_longer(-method, names_to = "metric") %>%
  mutate(value = round(value, 2)) %>%
  group_by(method)

bench_plot <- dat %>%
  ggplot(aes(method, value)) +
  geom_boxplot() +
  facet_wrap(metric ~ ., scales = "free") +
  theme_classic() +
  labs(y = "", x = "") +
  coord_flip()
ggsave(snakemake@output[["plot"]], plot = bench_plot)

R ggplot2 tidyverse From line 1 of scripts/plot_benchmarks.R

schtools::log_snakemake()
library(dplyr)
library(ggplot2)
library(schtools)

feat_df <- readr::read_csv(snakemake@input[["csv"]])
top_n <- as.numeric(snakemake@params[["top_n"]])

top_feats <- feat_df %>%
  group_by(method, names) %>%
  summarize(median_diff = median(perf_metric_diff)) %>%
  slice_max(order_by = median_diff, n = top_n)

feat_plot <- feat_df %>%
  right_join(top_feats, by = c("method", "names")) %>%
  mutate(features = factor(names, levels = unique(top_feats$names))) %>%
  ggplot(aes(x = perf_metric_diff, y = features, color = method)) +
  geom_boxplot() +
  facet_wrap(~method) +
  theme_sovacool()

ggsave(
  filename = snakemake@output[["plot"]],
  plot = feat_plot,
  device = "png"
)

R ggplot2 dplyr schtools From line 1 of scripts/plot_feature_importance.R

schtools::log_snakemake()

hp_perf <- readRDS(snakemake@input[["rds"]])
hp_plot_list <- lapply(hp_perf$params, function(param) {
  mikropml::plot_hp_performance(hp_perf$dat, !!rlang::sym(param), !!rlang::sym(hp_perf$metric)) + ggplot2::theme_classic() + ggplot2::scale_color_brewer(palette = "Dark2") + ggplot2::labs(title = unique(hp_perf$method))
})
hp_plot <- cowplot::plot_grid(plotlist = hp_plot_list)
ggplot2::ggsave(snakemake@output[["plot"]])

R From line 1 of scripts/plot_hp_perf.R

schtools::log_snakemake()
library(tidyverse)

perf_plot <- snakemake@input[["csv"]] %>%
  read_csv() %>%
  mikropml::plot_model_performance() +
  theme_classic() +
  scale_color_brewer(palette = "Dark2") +
  coord_flip()
ggsave(snakemake@output[["plot"]], plot = perf_plot)

R tidyverse From line 1 of scripts/plot_performance.R

schtools::log_snakemake()
library(patchwork)
library(tidyverse)


dat <- read_csv(snakemake@input[["csv"]])

calc_mean_perf <- function(sensspec_dat,
                           group_var = specificity,
                           sum_var = sensitivity,
                           custom_group_vars = NULL) {
  specificity <- sensitivity <- sd <- NULL
  dat_round <- sensspec_dat %>%
    dplyr::mutate({{ group_var }} := round({{ group_var }}, 2))
  if (!is.null(custom_group_vars)) {
    dat_grouped <- dat_round %>%
      dplyr::group_by({{ group_var }}, !!rlang::sym(custom_group_vars))
  } else {
    dat_grouped <- dat_round %>%
      dplyr::group_by({{ group_var }})
  }
  return(
    dat_grouped %>%
      dplyr::summarise(
        mean = mean({{ sum_var }}),
        sd = stats::sd({{ sum_var }})
      ) %>%
      dplyr::mutate(
        upper = mean + sd,
        lower = mean - sd,
        upper = dplyr::case_when(
          upper > 1 ~ 1,
          TRUE ~ upper
        ),
        lower = dplyr::case_when(
          lower < 0 ~ 0,
          TRUE ~ lower
        )
      ) %>%
      dplyr::rename(
        "mean_{{ sum_var }}" := mean,
        "sd_{{ sum_var }}" := sd
      )
  )
}

calc_mean_roc <- function(sensspec_dat, custom_group_vars = NULL) {
  specificity <- sensitivity <- NULL
  return(calc_mean_perf(sensspec_dat,
    group_var = specificity,
    sum_var = sensitivity,
    custom_group_vars = custom_group_vars
  ))
}

calc_mean_prc <- function(sensspec_dat, custom_group_vars = NULL) {
  sensitivity <- recall <- precision <- NULL
  return(calc_mean_perf(
    sensspec_dat %>%
      dplyr::rename(recall = sensitivity),
    group_var = recall,
    sum_var = precision,
    custom_group_vars = custom_group_vars
  ))
}

shared_ggprotos <- function(colorvar) {
  return(list(
    ggplot2::geom_ribbon(aes(fill = {{ colorvar }}), alpha = 0.5),
    ggplot2::geom_line(aes(color = {{ colorvar }})),
    ggplot2::coord_equal(),
    ggplot2::scale_y_continuous(expand = c(0, 0), limits = c(-0.01, 1.01)),
    ggplot2::theme_bw(),
    ggplot2::theme(legend.title = ggplot2::element_blank())
  ))
}

plot_mean_roc <- function(dat) {
  specificity <- mean_sensitivity <- lower <- upper <- NULL
  dat %>%
    ggplot2::ggplot(ggplot2::aes(
      x = specificity, y = mean_sensitivity,
      ymin = lower, ymax = upper
    )) +
    shared_ggprotos(colorvar = method) +
    ggplot2::geom_abline(
      intercept = 1, slope = 1,
      linetype = "dashed", color = "grey50"
    ) +
    ggplot2::scale_x_reverse(expand = c(0, 0), limits = c(1.01, -0.01)) +
    ggplot2::labs(x = "Specificity", y = "Mean Sensitivity")
}

plot_mean_prc <- function(dat, baseline_precision = NULL) {
  recall <- mean_precision <- lower <- upper <- NULL
  prc_plot <- dat %>%
    ggplot2::ggplot(ggplot2::aes(
      x = recall, y = mean_precision,
      ymin = lower, ymax = upper
    )) +
    shared_ggprotos(colorvar = method) +
    ggplot2::scale_x_continuous(expand = c(0, 0), limits = c(-0.01, 1.01)) +
    ggplot2::labs(x = "Recall", y = "Mean Precision")
  if (!is.null(baseline_precision)) {
    prc_plot <- prc_plot +
      ggplot2::geom_hline(
        yintercept = baseline_precision,
        linetype = "dashed", color = "grey50"
      )
  }
  return(prc_plot)
}
p <- (dat %>%
  calc_mean_roc(custom_group_vars = "method") %>%
  plot_mean_roc()) +
  (dat %>%
    calc_mean_prc(custom_group_vars = "method") %>%
    plot_mean_prc() +
    theme(legend.position = "none"))

ggsave(
  filename = snakemake@output[["plot"]],
  plot = p,
  device = "png",
  height = 4,
  width = 6
)

R tidyverse patcHwork From line 1 of scripts/plot_roc_curves.R

schtools::log_snakemake()
library(mikropml)

doFuture::registerDoFuture()
future::plan(future::multicore, workers = snakemake@threads)

data_raw <- readr::read_csv(snakemake@input[["csv"]])
data_processed <- preprocess_data(data_raw, outcome_colname = snakemake@params[["outcome_colname"]])

saveRDS(data_processed, file = snakemake@output[["rds"]])

R mikropml From line 1 of scripts/preproc.R

schtools::set_knitr_opts()

R Markdown From line 12 of scripts/report.Rmd

library(knitr)

R Markdown knitr From line 16 of scripts/report.Rmd

include_graphics(snakemake@input[['rulegraph']])

R Markdown From line 29 of scripts/report.Rmd

include_graphics(snakemake@input[['perf_plot']])

R Markdown From line 35 of scripts/report.Rmd

include_graphics(snakemake@input[['roc_plot']])

R Markdown From line 39 of scripts/report.Rmd

include_graphics(snakemake@input[['hp_plot']])

R Markdown From line 45 of scripts/report.Rmd

if (isTRUE(snakemake@params[['find_feature_importance']])) { 
    cat("## Feature Importance") 
}

R Markdown From line 49 of scripts/report.Rmd

include_graphics(snakemake@input[['feat_plot']])

R Markdown From line 55 of scripts/report.Rmd

include_graphics(snakemake@input[['bench_plot']])

R Markdown From line 64 of scripts/report.Rmd

schtools::log_snakemake()
library(dplyr)
doFuture::registerDoFuture()
future::plan(future::multicore, workers = snakemake@threads)

method <- snakemake@params[["method"]]
seed <- as.numeric(snakemake@params[["seed"]])
hyperparams <- snakemake@params[["hyperparams"]][[method]]
data_processed <- readRDS(snakemake@input[["rds"]])$dat_transformed

ml_results <- mikropml::run_ml(
  dataset = data_processed,
  method = method,
  outcome_colname = snakemake@params[["outcome_colname"]],
  find_feature_importance = FALSE,
  kfold = as.numeric(snakemake@params[["kfold"]]),
  seed = seed,
  hyperparameters = hyperparams
)

wildcards <- schtools::get_wildcards_tbl()

readr::write_csv(
  ml_results$performance %>%
    inner_join(wildcards, by = c("method", "seed")),
  snakemake@output[["perf"]]
)
readr::write_csv(ml_results$test_data, snakemake@output[["test"]])
saveRDS(ml_results$trained_model, file = snakemake@output[["model"]])

R dplyr From line 1 of scripts/train_ml.R

script:
    "scripts/report.Rmd"

SnakeMake From line 84 of workflow/Snakefile

shell:
    """
    zip -r {output} {input} 2> {log}
    """

SnakeMake From line 104 of workflow/Snakefile

ShowHide 33 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/SchlossLab/mikropml-snakemake-workflow

Name: mikropml-snakemake-workflow

Version: v1.3.0

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

patcHwork Snakemake dplyr ggplot2 knitr mikropml readr schtools tidyverse

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free