Snakemake workflow for orchestration of the LEGEND simulation chain

public 1yr ago Version: v0.1.1 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Lack of a description for a new keyword docs .

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

End-to-end Snakemake workflow to run Monte Carlo simulations of signal and background signatures in the LEGEND experiment and produce probability-density functions (pdfs). Configuration metadata (e.g. rules for generating simulation macros or post-processing settings) is stored at legend-simflow-config .

Key concepts

Simulations are labeled by an unique identifier (e.g. l200a-hpge-bulk-2vbb ), often referred as simid (simID). The identifier is defined in legend-simflow-config through simconfig.json files in tier directories raw and ver .
Each simulation is defined by a template macro (also stored as metadata) and by a set of rules (in simconfig.json ) needed to generate the actual macros (template variable substitutions, number of primaries, number of jobs, etc).
The production is organized in tiers. The state of a simulation in a certain tier is labeled as <tier>.<simid> . Snakemake understands this syntax.
The generated pdfs refer to a user-defined selection of LEGEND data taking runs. Such a list of runs is specified through the configuration file.
The production can be restricted to a subset of simulations by passing a list of identifiers to Snakemake.

Workflow steps (tiers)

DAG example

Figure: representation of an example workflow for simID l200a-hpge-bulk-2vbb , 5 parallel jobs and runIDs l200-p04-r00{2,3}-phy . The first word in each box is the name of the corresponding Snakemake rule.

Macro files are generated and writted to disk according to rules defined in the metadata. These macros are in one-to-one correspondence with simulation jobs in the next tiers
Tier ver building: run simulations that generate Monte Carlo event vertices needed to some simulations in the next tier. Simulations that do not need a special event vertices will directly start from tier raw .
Tier raw building: run full event simulations.
Tier hit building: run the first (hit-oriented) step of simulation post-processing. Here, "hit" stands for Geant4 "step". In this tier, step-wise operations like optical map application or step clustering are typically applied.
Tier evt building: multiple operations are performed in order to build actual events and incorporate information about the data taking runs for which the user wants to build pdfs:
- Partition the hit event statistics into fractions corresponding to the actual total livetime fraction spanned by each selected run. This information is extracted from legend-metadata/dataprod/runinfo.json
- Apply HPGe energy resolution functions for each selected run found in the data production auto-generated metadata
- Apply HPGe status flags (available in legend-metadata/hardware/config )
Tier pdf building: summarize evt -tier output into histograms (the pdfs).

Setup

Set up legend-prodenv to collect software dependencies, instantiate and manage multiple production environments. Snakemake can be installed by following instructions in legend-prodenv/README.md .

Instantiate a new production cycle:

> git clone git@github.com:legend-exp/legend-prodenv
> cd legend-prodenv && source setup.sh
> simprod-init-cycle <path-to-cycle-directory>

Customize <path-to-cycle-directory>/config.json (see next section).

The configuration file

The config.json file in the production directory allows to customize the workflow in great detail. Here's a basic description of its fields:

experiment : labels the experiment to be simulated. The same name is used in the metadata to label the corresponding configuration files.
simlist : list of simulation identifiers (see below) to be processed by Snakemake. Can be a list of strings or a path to a text file. If * or all , will process all simulations defined in the metadata.
runlist : list of LEGEND data taking runs to build pdfs for, in the standard format <experiment>-<period>-<run>-<type> (e.g. l200-p03-r000-phy )
benchmark : section used to configure a benchmarking run:
- enabled : boolean flag to enable/disable the feature
- n_primaries : number of primary events to be simulated in the lower tiers ver and raw .
paths : customize paths to input or output files.
filetypes : customize the file extensions to be used for input and output files generated ny Snakemake.
runcmd : defines for each tier the command to be executed in order to produce output files (with Snakemake wildcards, see the corresponding rule for more details). The scripts/MaGe.sh wrapper, which detects problems in the simulation output and exits with a proper code, should be used (and kept up-to-date) instead of the MaGe command. This allows Snakemake to better detect job failures.
execenv : defines the software environment (container) where all jobs should be executed (see below).

Note all these configuration parameters can be overridden at runtime through Snakemake's --config option.

Production

Run a production by using one of the provided site-specific profiles (recommended):

> cd <path-to-cycle-directory>
> snakemake --profile workflow/profiles/<profile-name>

If no system-specific profiles are provided, the --profile option can be omitted. Snakemake will use the default profile.

> snakemake

On a system providing Scratch space , like NERSC, the --shadow-prefix option should be set to point to it:

> snakemake --shadow-prefix <path-to-scratch-area> [...]

The --config command line option is very useful to override configuration values. It can be used, for example, to restrict the production to a subset of simulations:

> snakemake --config simlist="mylist.txt" [...]

where mylist.txt is a text file in the format:

raw.l200a-fibers-Ra224-to-Pb208
hit.l200a-hpge-bulk-2vbb
...

One can even just directly pass a comma-separated list:

> snakemake --config simlist="raw.l200a-fibers-Ra224-to-Pb208,hit.l200a-hpge-bulk-2vbb"

Once the production is over, the print_stats rule can be used to display a table with runtime statistics:

> snakemake -q all print_stats
 wall time [s] wall time [s]
simid (cumulative) jobs (per job) primaries
----- ------------- ---- ------------- ---------
hit.l200a-fibers-Ra224-to-Pb208 83:20:00 100 0:50:00 1.00E+08
raw.l200a-fibers-Ra224-to-Pb208 58:20:35 100 0:35:00 1.00E+08
raw.l200a-fibers-Rn222-to-Po214 33:20:00 100 0:20:00 1.00E+08
... ... ... ... ...

The inspect_simjob_logs rule allows to inspect the simulation log files in search for warnings:

> snakemake inspect_simjob_logs

This can generate a lot of output, consider piping it to a file.

Find some useful Snakemake command-line options at the bottom of this page.

Warning

Geant4 macro files are marked as ancient as a workaround to the fact that they might have been re-generated (i.e. they have a more recent creation time) but with the same content as before (i.e. there is no need to re-run the simulations). Since Snakemake removes output files before re-generating, there is no simple way to avoid overwriting files with unchanged content. The workaround lets user increase simulation statistics by running new jobs (identical to those already on disk). However, If the macro content is updated in the metadata, users will need to manually remove the output simulation files or force execution.

Benchmarking runs

This workflow implements the possibility to run special "benchmarking" runs in order to evaluate the speed of simulations, for tuning the number of events to simulate for each simulation run.

Create a new, dedicated production cycle (see above)
Enable benchmarking in the configuration file and customize further settings
Start the production as usual.

Snakemake will spawn a single job (with the number of primary events specified in the configuration file) for each simulation. Once the production is over, the results can be summarized via the print_benchmark_stats rule:

> snakemake -q all print_benchmark_stats
simid CPU time [ms/ev] evts / 1h jobs (1h) / 10^8 evts
----- ---------------- --------- ---------------------
raw.l200a-birds-nest-K40 (13s) 2.79 1288475 77
raw.l200a-birds-nest-Ra224-to-Pb208 (191s) 38.33 93916 1064
raw.l200a-fiber-support-copper-Co60 (223s) 44.69 80558 1241
... ... ... ...

Note The CPU time is a good measure of the actual simulation time, since other tasks (e.g. application loading) are typically not CPU intensive.

NERSC-specific instructions

Setup

As an alternative to installing Snakemake through legend-prodenv's tools, NERSC's Mamba can be used .

> module load python
> export SWPREFIX="/global/common/software/m2676/<your user>"
> mamba create --prefix $SWPREFIX/.conda/snakemake -c conda-forge -c bioconda snakemake uproot panoptes-ui root
> mamba activate $SWPREFIX/.conda/snakemake

Note

To make proper use of LEGEND's shifter containers, special permissions must be set on the input/simprod/config/MaGe directory and all its parents (see docs ):

> setfacl -m u:nobody:x <path-to-cycle-directory>/input/simprod/MaGe
> setfacl -m u:nobody:x <path-to-cycle-directory>/input/simprod
> setfacl -m u:nobody:x <path-to-cycle-directory>/input
> setfacl -m u:nobody:x <path-to-cycle-directory>
...and further back if needed...

This is not needed if hosting the production below $PSCRATCH or if using Podman-HPC .

Production

Start the production on the interactive node (the default profile works fine):

snakemake --shadow-prefix "$PSCRATCH"

Start the production on the batch nodes (via SLURM):

snakemake --profile workflow/profiles/nersc-batch --shadow-prefix "$PSCRATCH"

Warning This profile does not work as expected at the moment, see https://github.com/legend-exp/legend-simflow/issues/8. This temporary script can be used instead. Note that the maximum runtime at NERSC is 12 hours, so jobs might be killed.

nersc-submit.sh usage:

Send a SLURM job for each simulation ID up to the pdf tier (in parallel).

> cd <production dir>
> ./workflow/profiles/nersc-batch/nersc-submit.sh parallel
Dry run, to check what would be submitted (recommended):

> DRY_RUN=1 ./workflow/profiles/nersc-batch/nersc-submit.sh
Send a Snakemake execution as a single job:

> ./workflow/profiles/nersc-batch/nersc-submit.sh [SNAKEMAKE ARGS]

Useful Snakemake CLI options

usage: snakemake [OPTIONS] -- [TARGET ...]
 --dry-run, --dryrun, -n
 Do not execute anything, and display what would be done. If you have a very large workflow,
 use --dry-run --quiet to just print a summary of the DAG of jobs.
 --profile PROFILE Name of profile to use for configuring Snakemake.
 --cores [N], -c [N] Use at most N CPU cores/jobs in parallel. If N is omitted or 'all', the limit is set to the
 number of available CPU cores.
 --config [KEY=VALUE ...], -C [KEY=VALUE ...]
 Set or overwrite values in the workflow config object.
 --configfile FILE [FILE ...], --configfiles FILE [FILE ...]
 Specify or overwrite the config file of the workflow.
 --touch, -t Touch output files (mark them up to date without really changing them) instead of running
 their commands.
 --forceall, -F Force the execution of the selected (or the first) rule and all rules it is dependent on
 regardless of already created output.
 --shadow-prefix DIR Specify a directory in which the 'shadow' directory is created. If not supplied, the value
 is set to the '.snakemake' directory relative to the working directory.
 --list, -l Show available rules in given Snakefile. (default: False)
 --list-target-rules, --lt
 Show available target rules in given Snakefile.
 --summary, -S Print a summary of all files created by the workflow.
 --printshellcmds, -p Print out the shell commands that will be executed. (default: False)
 --quiet [{progress,rules,all} ...], -q [{progress,rules,all} ...]
 Do not output certain information. If used without arguments, do not output any progress or
 rule information. Defining 'all' results in no information being printed at all.

Running jobs in the LEGEND software container

with container-env

In config.json :

"execenv": [
 "MESHFILESPATH=$_/inputs/simprod/MaGe/data/legendgeometry/stl_files",
 "MAGERESULTS=$_/inputs/simprod/MaGe/data/legendgeometry",
 "cenv", "legendsw"
]

where the legendsw environment has been previously created with

cenv --create legendsw <path-to-container>

and the environment variables are needed for MaGe to work.

with NERSC Shifter

In config.json :

"execenv": [
 "shifter",
 "--image legendexp/legend-software:latest",
 "--volume $_/inputs/simprod/MaGe:/private",
 "--env MESHFILESPATH=/private/data/legendgeometry/stl_files",
 "--env MAGERESULTS=/private/data/legendgeometry"
]

With NERSC Podman-HPC

"execenv": [
 "podman-hpc run",
 "--volume $_/inputs/simprod/MaGe:/private", // mount private MaGe resources
 "--env MESHFILESPATH=/private/data/legendgeometry/stl_files",
 "--env MAGERESULTS=/private/data/legendgeometry",
 "--volume $_:$_", // make production folder available in the container
 "--volume $PSCRATCH:$PSCRATCH", // make scratch area visible too
 "--workdir $$PWD", // podman-hpc does not automatically cd into cwd, unfortunately. NOTE: double $$
 "docker.io/legendexp/legend-software:latest",
 "--"
]

Code Snippets

script:
    "../scripts/print_simprod_stats.py"

SnakeMake From line 40 of rules/aux.smk

script:
    "../scripts/print_benchmark_stats.py"

SnakeMake From line 49 of rules/aux.smk

script:
    "../scripts/inspect_MaGe_logs.sh"

SnakeMake From line 58 of rules/aux.smk

import json
import re
from pathlib import Path

import snakemake as smk
from utils import aggregate, patterns, utils

with Path(snakemake.input.cfgfile).open() as f:
    config = json.load(f)[snakemake.params.simid]

n_prim = config.pop("number_of_primaries")
n_macros = None
outver_list = None
is_benchmark = False

# determine whether external vertices are required
if "vertices" in config:
    # get list of ver output files
    outver_list = aggregate.gen_list_of_simid_outputs(
        snakemake.config, "ver", config.pop("vertices")
    )

    # if number of jobs is not specified, use number of vertices files
    if "number_of_jobs" not in config:
        n_macros = len(outver_list)

# if n_macros could not be determined, there MUST be an explicit reference to
# it in the config
if not n_macros:
    n_macros = config.pop("number_of_jobs")

if "benchmark" in snakemake.config:
    is_benchmark = snakemake.config["benchmark"].get("enabled", False)
    if is_benchmark:
        n_macros = 1
        n_prim = snakemake.config["benchmark"]["n_primaries"][snakemake.params.tier]

# prepare global substitution rules
substitutions = {"NUMBER_OF_PRIMARIES": int(n_prim / n_macros)}

# now that we processed the "special" configuration fields, we interpret all
# the others as variables to be expanded in the template macro
for k, v in config.items():
    if isinstance(v, str):
        substitutions.update({k.upper(): v})
    elif isinstance(v, list) and all(isinstance(s, str) for s in v):
        substitutions.update({k.upper(): "\n".join(v)})
    else:
        print(f"WARNING: key '{k}' does not define a valid substitution rule")

# first substitute global variables
with Path(snakemake.input.template).open() as f:
    text = utils.subst_vars(f.read().strip(), substitutions, ignore_missing=True)

    # if benchmark run, write all events to disk, not only those that deposit
    # energy in the detectors. TODO: hardcoding application-specific commands
    # is not nice here
    if is_benchmark:
        text = re.sub(
            r"\n */MG/io/MCRun/setWriteEventsThatDepositEnergyInGe +true *\n",
            "\n",
            text,
        )

# then substitute macro-specific variables
for i in range(n_macros):
    # determine output file name for this macro
    outname = patterns.output_simjob_filename(
        snakemake.config,
        tier=snakemake.params.tier,
        simid=snakemake.params.simid,
        jobid=f"{i:>04d}",
    )

    substitutions.update({"OUTPUT_FILE": outname})
    # might also need to set a vertices file name (see above)
    if outver_list:
        substitutions.update({"VERTICES_FILE": outver_list[i]})

    # determine the macro file name for write out
    inname = Path(
        patterns.input_simjob_filename(
            snakemake.config,
            tier=snakemake.params.tier,
            simid=snakemake.params.simid,
            jobid=f"{i:>04d}",
        )
    )

    text_out = utils.subst_vars(text, substitutions).strip()

    # check if the file exists and open it
    # NOTE: unfortunately, this doesn't produce the desired effect with
    # Snakemake, as it always removes output files before regenerating them:
    # https://stackoverflow.com/questions/49178033/snakemake-avoid-removing-output-files-before-executing-the-shell-command
    if inname.is_file():
        with inname.open() as f:
            # if file content is already the expected one, do not overwrite it
            if f.read().strip() == text_out:
                continue

    # otherwise, prepare for writing and write
    smk.utils.makedirs(str(inname.parent))
    with inname.open("w") as f:
        f.write(text_out)

Python Snakemake JSON utils From line 18 of scripts/generate_macros.py

logs="${snakemake_params[logdir]}"

function it() {
    code=0

    grep -A 15 -i -- "-------- WWWW ------- G4Exception-START -------- WWWW -------" "$1" && code=1
    grep -P "^Warning:\w+" "$1" && code=1

    if [ "$code" != "0" ]; then
        echo
        echo -e "\033[1m^^^^^ $1 ^^^^^\033[0m"
        echo
    fi
}

find "$logs" -regextype "egrep" -regex ".*(ver|raw)/.+/.+.log" \
    | while read -r file; do it "$file"; done

Shell From line 4 of scripts/inspect_MaGe_logs.sh

import json
from pathlib import Path

import uproot
from utils import utils

with Path(snakemake.input.runinfo).open() as f:
    rinfo = json.load(f)

# retrieve run livetimes
runs = utils.get_some_list(list(set(snakemake.config["runlist"])))
spec = [r.split("-") for r in runs]
livetimes = [rinfo[p[1]][p[2]][p[3]]["livetime_in_s"] for p in spec]

# get total number of mc events from hit files
print(
    "INFO: computing total number of MC events for simulation "
    + snakemake.wildcards.simid
)
file_evts = uproot.num_entries(
    [f"{file}:simTree" for file in snakemake.input.hit_files]
)
tot_events = 0
for file in file_evts:
    tot_events += file[-1]

# write json file with weights for each run
with Path(snakemake.output[0]).open("w") as f:
    json.dump(
        dict(zip(runs, [int(tot_events * t / sum(livetimes)) for t in livetimes])),
        f,
        indent=2,
    )

Python JSON utils uproot From line 18 of scripts/make_run_partition_file.py

import os
from pathlib import Path

# TODO: generate the files from metadata inplace here
# this script is run for each data taking run
# snakemake.input[0] points to the legend-metadata clone in legend-simflow
# snakemake.output[0] is the output filename

src = (
    Path(snakemake.input[0])
    / "simprod"
    / "config"
    / "tier"
    / "evt"
    / snakemake.config["experiment"]
    / f"{snakemake.wildcards.runid}-build_evt.json"
)
dest = snakemake.output[0]

os.system(f"cp {src} {dest}")

Python From line 18 of scripts/make_tier_evt_config_file.py

import csv
from pathlib import Path


def printline(*line):
    print("{:<50}{:>16}{:>11}{:>23}".format(*line))


printline("simid", "CPU time [ms/ev]", "evts / 1h", "jobs (1h) / 10^8 evts")
printline("-----", "----------------", "---------", "---------------------")

bdir = Path(snakemake.config["paths"]["benchmarks"])

for simd in sorted(bdir.glob("*/*")):
    if simd.parent.name not in ("ver", "raw"):
        continue

    data = {"cpu_time": 0}
    for jobd in simd.glob("*.tsv"):
        with jobd.open(newline="") as f:
            this_data = list(csv.DictReader(f, delimiter="\t"))[0]
            data["cpu_time"] += float(this_data["cpu_time"])

    speed = (
        data["cpu_time"]
        / snakemake.config["benchmark"]["n_primaries"][simd.parent.name]
    )
    evts_1h = int(60 * 60 / speed) if speed > 0 else "IN PROGRESS"
    njobs = int(1e8 / evts_1h) if not isinstance(evts_1h, str) else 0
    printline(
        simd.parent.name + "." + simd.name,
        "({:}s) {:.2f}".format(int(data["cpu_time"]), 1000 * speed),
        evts_1h,
        njobs,
    )

Python From line 18 of scripts/print_benchmark_stats.py

import csv
import json
from datetime import timedelta
from pathlib import Path

from utils import patterns


def printline(*line):
    print("{:<50} {:>20} {:>6} {:>14} {:>10}".format(*line))


printline("     ", "wall time [s]", "    ", "wall time [s]", "         ")
printline("simid", " (cumulative)", "jobs", "    (per job)", "primaries")
printline("-----", "-------------", "----", "-------------", "---------")

bdir = Path(snakemake.config["paths"]["benchmarks"])

tot_wall_time = 0
for simd in sorted(bdir.glob("*/*")):
    njobs = 0
    data = {"wall_time": 0}
    for jobd in simd.glob("*.tsv"):
        njobs += 1
        with jobd.open(newline="") as f:
            this_data = list(csv.DictReader(f, delimiter="\t"))[0]
            data["wall_time"] += float(this_data["s"])
    tot_wall_time += data["wall_time"]

    if njobs == 0:
        continue

    tier = simd.parent.name if simd.parent.name in ("ver", "raw") else "raw"

    tdir = patterns.template_macro_dir(snakemake.config, tier=tier)
    with (tdir / "simconfig.json").open() as f:
        config = json.load(f)[simd.name]

    nprim = config["number_of_primaries"]

    printline(
        simd.parent.name + "." + simd.name,
        str(timedelta(seconds=int(data["wall_time"]))),
        njobs,
        str(timedelta(seconds=int(data["wall_time"] / njobs))),
        f"{nprim:.2E}",
    )

print("\nTotal wall time:", str(timedelta(seconds=int(tot_wall_time))))

Python JSON utils From line 18 of scripts/print_simprod_stats.py

script:
    "scripts/generate_macros.py"

SnakeMake From line 127 of main/Snakefile

script:
    "scripts/make_tier_evt_config_file.py"

SnakeMake From line 240 of main/Snakefile

script:
    "scripts/make_run_partition_file.py"

SnakeMake From line 255 of main/Snakefile

ShowHide 10 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/legend-exp/legend-simflow

Name: legend-simflow

Version: v0.1.1

Badge:

Insert copied code into your website to add a link to this workflow.

License: GNU General Public License v3.0

Keywords:

docs JSON Resource metadata Simulation analysis Report Snakemake uproot utils Biomolecular simulation

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free