A simple and entertaining introduction to Snakemake workflows.

public 1yr ago 0 bookmarks

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Snakespeare is a simple and entertaining text mining workflow designed for first-time Snakemake users and developers.

Snakespeare learning goals

This tutorial is ideal for beginners, including folks who have never used the terminal before.

After working through this tutorial, you will learn:

How to access the terminal on your computer (Windows, Mac, or Linux)
How to clone a repository from GitHub
How to build and activate a conda environment
How to run a Snakemake workflow and view the results

Diving deeper

In addition, this repository is a simple demonstration of Snakemake for workflow developers.

The moving parts of Snakespeare are identical to the Snakemake pipelines I use every day for bioinformatics work:

Snakefile – contains rules for all steps of workflow
config.yaml – parameters users can customize are listed here
environment.yaml – lists software dependencies to be installed into conda virtual environment
scripts/ – all Python and R scripts live in this directory
data/ – all input and output files live in this directory

Snakespeare results

This workflow calculates and plots how much characters speak in Shakespeare's tragedies Hamlet and Romeo & Juliet .

Example output from Snakespeare workflow.

Interesting statistics

Hamlet talks the most with over 1428 lines of iambic pentameter.
Hamlet's uncle Claudius talks the second most with over 500 lines of iambic pentameter. It must run in the family.
Besides the Chorus in Romeo and Juliet , the Ghost of King Hamlet is the most long-winded with an average speech length of 6.3 lines .
Friar Lawrence is a close second with an average speech length of 6.2 lines .
Romeo talks slightly more than Juliet (however, Juliet's lines are wittier).

Usage

STEP 1: Install miniconda and git

To run Snakespeare, you will need two pieces of software: git and conda .

git is a tool for downloading the code for Snakespeare from GitHub.
conda is a tool for accessing all software dependencies (including R, Python, and Snakemake).

All software dependencies will be installed into a "virtual environment," so Snakespeare will not conflict with any Python or R software you have set up already.

Click here for instructions for Windows

Run Snakespeare via Anaconda prompt (easiest for beginning users)

Installing Miniconda3 + Anaconda Prompt for Windows

Head over to the Anaconda website and download a Windows installer for Miniconda3 .

If you are not sure which to choose, pick the highest version of Python.

You can check whether your system is 64-bit or 32-bit under Settings > System > About > Device specifications > System type .

Run the installer and follow the instructions to complete the installation. This software bundle includes Miniconda3 as well as Anaconda Prompt, which is a terminal app that you can use to run Snakespeare.

Open Anaconda Prompt

Now click the Start menu and search for " Anaconda prompt ." This is a modified version of Windows Command Prompt ( cmd.exe ) that is pre-loaded with the conda executable.

Installing Git in Anaconda Prompt

In Anaconda prompt, copy and paste the following to install git:

conda install -y git

That's it! Continue to STEP 2.

Run Snakespeare via Git Bash (good for beginning users)

Installing Miniconda3 for Windows

Head over to the Anaconda website and download a Windows installer for Miniconda3 .

If you are not sure which to choose, pick the highest version of Python.

You can check whether your system is 64-bit or 32-bit under Settings > System > About > Device specifications > System type .

Run the installer and follow the instructions to complete your installation of Miniconda3.

Installing Git + Git Bash for Windows

Head to the git website and download an installer for Windows. Run the installer and follow the instructions to complete the installation. This software bundle includes git as well as Git Bash, which is a terminal app that you can use to run Snakespeare.

Important: While installing Git for Windows, be sure to check the box to add "Git Bash here" to the File Explorer context menu. You'll need it for the next step.

Enabling Conda in Git Bash

To enable Conda within Git Bash, you'll need to add the Conda startup script to your ~/.bashrc file, which executes every time you open Git Bash.

From the Start menu, search for "Miniconda3" and click "Open File Location." Within that folder, navigate to etc and then profile.d . You should see a file called conda.sh in this folder. Right-click inside the window and select "Git Bash here" to open a terminal window in this folder.

Run the following command to to add the Conda startup script to your ~/.bashrc :

echo ". '${PWD}'/conda.sh" >> ~/.bashrc

After that, close the terminal window.

Finally, let's double-check that conda is working in your new Git Bash terminal. From the Start menu, open Git Bash . Type conda and press Enter. If a bunch of text appears (these are the usage instructions for conda), congratulations, you're all set up! Continue to STEP 2.

Run Snakespeare via Windows Subsystem for Linux (advanced users) If you are already using Windows Subsystem for Linux, follow the instructions below for how to install miniconda and git in your Ubuntu terminal.

Installing Git in WSL

Head to the git website and follow the installation instructions for Ubuntu.

Installing Miniconda in WSL

Head to the Anaconda website for instructions to download and run a Miniconda installer for Linux.

After installing git and miniconda, close any terminal windows you have open and continue to STEP 2.

Click here for instructions for Mac

Installing Git for Mac

On your Mac, open Terminal. Type git and press Enter.

If a bunch of text appears (these are the usage instructions for git), congratulations, you already have git installed! Skip to Installing Miniconda for Mac .
If you see git: command not found , then you will need to get git for Mac. The easiest method is to install Xcode , which is a suite of developer tools provided by Apple.
After installing Xcode, open a new terminal window and try typing git again. You should see the usage instructions now.

If you still see git: command not found , please let me know so I can help.

Installing Miniconda for Mac

To get Miniconda for Mac, download an installer from the Anaconda website .
If you are not sure which to choose, download the Python 3.9 Miniconda3 MacOSX 64-bit pkg .
Run the installer that just downloaded, and follow the instructions to complete your installation of Miniconda.

Done! Make sure you close any terminal windows that you have open, then continue to STEP 2.

Click here for instructions for Linux

Linux desktop users

Installing Git for Linux

Head to the git website for instructions to install git with your distribution's package manager.

Installing Miniconda for Linux

Head to the Anaconda website for instructions to download and run a Miniconda installer.

After installing git and miniconda, close any terminal windows you have open and continue to STEP 2.

Linux server users

If you would like to run Snakespeare on a work or lab server, check with your supervisor or sysadmin to see if git and conda are installed already. If so, continue to STEP 2.

Otherwise, if you need to install software (and have permission to do so), follow the instructions below.

Installing Miniconda on a Linux Server

To install miniconda from the command line:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

The installer will ask you some questions to complete installation. Review and accept the license, accept or change home location, and answer yes to placing it in your path.

To finish configuring miniconda:

source $HOME/.bashrc

Note: If your home folder is not writable on your server, conda will crash. If you experience this issue, run these commands to tell conda to store the environment in the current folder.

conda config --add envs_dirs ./.conda/envs
conda config --add pkgs_dirs ./.conda/pkgs

Installing Git on a Linux Server

To install git :

conda install git

STEP 2: Clone the repository

Open a new terminal window and navigate to where you want to download Snakespeare.

If you are not sure, I recommend you copy and paste these commands to make a new directory called GitHub_repos , then "change directory" into the folder:

mkdir GitHub_repos
cd GitHub_repos

Copy and paste these commands to clone this repository and then "change directory" into the folder.

git clone https://github.com/lisakmalins/Snakespeare.git
cd Snakespeare

STEP 3: Build and activate the conda environment

When you build the conda environment , Conda obtains all the software listed in environment.yaml . You only need to do this step once.

conda env create -f environment.yaml

Finally, you will need to activate the environment . The environment is named "snakespeare," and the software will only be accessible while the environment is active.

conda activate snakespeare

Note: for older versions of Anaconda, you may need to use the command source activate snakespeare instead.

When you want to deactivate the environment later, you can do so with the command conda deactivate .

STEP 4: Run Snakespeare

Run the snakemake workflow like this:

snakemake

That's it! The workflow should finish within a few seconds. The output plot showing all dialogue statistics will appear in the folder Snakespeare/data/plots/ .

Code Snippets

suppressPackageStartupMessages(library(readr))
# dplyr is required for most of the dataframe manipulation.
suppressPackageStartupMessages(library(dplyr))

##-------------------- Read arguments --------------------##
args <- commandArgs(trailingOnly = TRUE)
input = args[0:(length(args)-2)]
play = args[length(args)-1]
output = args[length(args)]

##-------------------- Determine name of metrics --------------------##
# Determine metric from each filename by removing name of play
metrics <- sapply(input, function(f) {
  sub(paste0(".*", play, "_?"), "", sub("\\..*", "", f))
})

##-------------------- Read data --------------------##
data <- list()
for (f in input) {
  # Read data from file
  data[[f]] <-
    read_tsv(f,
             col_names=c("character", metrics[[f]]),
             col_types=cols())
}

##-------------------- Join data --------------------##
# Full outer join data together
all_data <-
  # Use full outer join to combine speeches and lines into one table
  data[[input[1]]] %>%
  full_join(data[[input[2]]], by="character") %>%
  # Determine average lines per speech
  mutate(avg_speech_length=total_lines/num_speeches) %>%
  # Sort descendingly by first metric
  arrange(desc(.[[2]])) %>%
  # Add column for name of play
  mutate(play=play)
print(all_data)

##-------------------- Write output --------------------##
write_tsv(all_data, output)

R dplyr readr From line 9 of scripts/calculate_speech_length.R

from collections import defaultdict

# Find start of play (skip title and character list)
def FindPlayStart(lines):
    for i in range (0, len(lines)):
        if lines[i][:5] == "SCENE":
            return i
    return -1

# Read in list of characters
with open(snakemake.input[1], 'r') as characters_input:
    characters = [l.strip() for l in characters_input.readlines()]

# Count each character's line blocks
speeches_by_character = defaultdict(int)

with open(snakemake.input[0], 'r') as play_input:
    lines = play_input.readlines()

i = FindPlayStart(lines)
while i < len(lines):
    # Get line and strip to compare with characters list
    line = lines[i].rstrip('\n').rstrip('.').lstrip()
    if line in characters:
        speeches_by_character[line] += 1
    i += 1

# Output to file
with open(snakemake.output[0], 'w') as output:
    for c, f in speeches_by_character.items():
        output.write(f"{c}\t{f}\n")

Python From line 6 of scripts/count_speeches.py

from collections import defaultdict

# Find start of play (skip title and character list)
def FindPlayStart(lines):
    for i in range (0, len(lines)):
        if lines[i][:5] == "SCENE":
            return i
    return -1

# Read in list of characters
with open(snakemake.input[1], 'r') as characters_input:
    characters = [l.strip() for l in characters_input.readlines()]

# Count each character's lines
lines_by_character = defaultdict(int)
with open(snakemake.input[0], 'r') as play_input:
    lines = play_input.readlines()

i = FindPlayStart(lines)
while i < len(lines):
    # Get line and strip to compare with characters list
    line = lines[i].rstrip('\n').rstrip('.').lstrip()
    if line in characters:
        # If char name found, count how many lines follow
        while True:
            i += 1
            # Stop if i out of range (end of play)
            # or if newline encountered (end of line block)
            if i >= len(lines) or lines[i] == "\n":
                break
            # Increment num lines for this character
            lines_by_character[line] += 1
    else:
        i += 1

# Output to file
with open(snakemake.output[0], 'w') as output:
    for c, f in lines_by_character.items():
        output.write(c + "\t" + str(f) + '\n')

Python From line 6 of scripts/count_total_lines.py

suppressPackageStartupMessages(library(readr))
# dplyr is required for most of the dataframe manipulation.
suppressPackageStartupMessages(library(dplyr))
# tidyr is required specifically for the pivot_longer function.
suppressPackageStartupMessages(library(tidyr))
# ggplot2 is required for plotting the data.
suppressPackageStartupMessages(library(ggplot2))
# yaml is required for reading the configuration file (config.yaml).
suppressPackageStartupMessages(library(yaml))

##-------------------- Read arguments --------------------##
args <- commandArgs(trailingOnly = TRUE)
input = args[0:(length(args)-1)]
output = args[length(args)]

##-------------------- Read config.yaml --------------------##
config <- read_yaml("config.yaml")

##-------------------- Prepare list of plays and metrics --------------------##
plays <- config[["plays"]]
metrics <- config[["metrics"]]

##-------------------- Read data --------------------##
# Concatenate data for all plays into a single table
all_data_rbind <-
  do.call("rbind", lapply(input, function(f) {
    read_tsv(f, col_names=TRUE, col_types=cols()) %>%
    # Set factor levels of "character" column in the order that they appear
    mutate(character = factor(character, levels=rev(character)))
  }))

# Filter out characters with few lines
if ("total_lines_cutoff" %in% names(config)) {
  total_lines_cutoff <- config[["total_lines_cutoff"]]
} else {
  total_lines_cutoff <- 25
}

print(paste("Characters with fewer than", total_lines_cutoff, "will be excluded."))
print(paste("The following rows will be dropped:"))
all_data_rbind %>%
  filter(total_lines <= total_lines_cutoff)

all_data_rbind <-
  all_data_rbind %>%
  filter(total_lines > total_lines_cutoff)

# Pivot data from "wide" into "long" format to prepare for `facet_wrap`
all_data_longer <-
  all_data_rbind %>%
  pivot_longer(cols=all_of(names(metrics)),
               names_to="metric",
               values_to="value")

# Set factor levels to control order of facets in facet-wrapped plot
all_data_longer$metric <-
  factor(all_data_longer$metric, levels = names(metrics))

##-------------------- Plot data --------------------##
ggplot(all_data_longer,
       aes(x=character, y=value)) +
  geom_col() +
  # Add title but suppress x- and y- axis labels
  labs(y=NULL,
       x=NULL,
       title=paste("Dialogue statistics for Shakespeare characters")) +
  # Grid plots by play and metric
  facet_grid(rows=vars(play),
             cols=vars(metric),
             scales="free",
             # Use more descriptive labels prepared earlier
             labeller=labeller(metric = unlist(metrics),
                               play = unlist(plays))) +
  # Make the bar graph horizontal to more easily read character names
  coord_flip() +
  # Add number label floating next to bars
  geom_text(aes(label=round(value, 1)), hjust = -0.2, color="#333333") +
  # Add a bit of extra space for the labels
  # Vector is c(mult[x], add[x], mult[y], add[y])
  scale_y_discrete(expand=c(0, 0, 0.2, 0)) +
  # Tweak theme
  theme_bw() +
  theme(text=element_text(size=18)) +
  theme(plot.title=element_text(hjust=0.5)) +
  theme(plot.caption=element_text(hjust=0, color="gray30"))

# Save plot
ggsave(
  output,
  height=8.5,
  width=11,
  plot=last_plot())

R ggplot2 dplyr tidyr readr yaml From line 5 of scripts/plot_all_metrics.R

script:
    "scripts/count_speeches.py"

SnakeMake From line 28 of master/Snakefile

script:
    "scripts/count_total_lines.py"

SnakeMake From line 38 of master/Snakefile

shell:
    "Rscript scripts/calculate_speech_length.R {input} \"{wildcards.play}\" {output}"

SnakeMake From line 50 of master/Snakefile

shell:
    "Rscript scripts/plot_all_metrics.R {input} {output}"

SnakeMake From line 59 of master/Snakefile

ShowHide 6 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/lisakmalins/Snakespeare

Name: snakespeare

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Snakemake dplyr ggplot2 readr tidyr yaml

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free