Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Snakespeare is a simple and entertaining text mining workflow designed for first-time Snakemake users and developers.
Snakespeare learning goals
This tutorial is ideal for beginners, including folks who have never used the terminal before.
After working through this tutorial, you will learn:
-
How to access the terminal on your computer (Windows, Mac, or Linux)
-
How to clone a repository from GitHub
-
How to build and activate a conda environment
-
How to run a Snakemake workflow and view the results
Diving deeper
In addition, this repository is a simple demonstration of Snakemake for workflow developers.
The moving parts of Snakespeare are identical to the Snakemake pipelines I use every day for bioinformatics work:
-
Snakefile
– contains rules for all steps of workflow -
config.yaml
– parameters users can customize are listed here -
environment.yaml
– lists software dependencies to be installed into conda virtual environment -
scripts/
– all Python and R scripts live in this directory -
data/
– all input and output files live in this directory
Snakespeare results
This workflow calculates and plots how much characters speak in Shakespeare's tragedies Hamlet and Romeo & Juliet .
Interesting statistics
-
Hamlet talks the most with over 1428 lines of iambic pentameter.
-
Hamlet's uncle Claudius talks the second most with over 500 lines of iambic pentameter. It must run in the family.
-
Besides the Chorus in Romeo and Juliet , the Ghost of King Hamlet is the most long-winded with an average speech length of 6.3 lines .
-
Friar Lawrence is a close second with an average speech length of 6.2 lines .
-
Romeo talks slightly more than Juliet (however, Juliet's lines are wittier).
Usage
STEP 1: Install miniconda and git
To run Snakespeare, you will need two pieces of software: git and conda .
-
git is a tool for downloading the code for Snakespeare from GitHub.
-
conda is a tool for accessing all software dependencies (including R, Python, and Snakemake).
Click here for instructions for WindowsAll software dependencies will be installed into a "virtual environment," so Snakespeare will not conflict with any Python or R software you have set up already.
Run Snakespeare via Anaconda prompt (easiest for beginning users)
Installing Miniconda3 + Anaconda Prompt for WindowsHead over to the Anaconda website and download a Windows installer for Miniconda3 .
Run the installer and follow the instructions to complete the installation. This software bundle includes Miniconda3 as well as Anaconda Prompt, which is a terminal app that you can use to run Snakespeare. Open Anaconda Prompt
Now click the Start menu and search for "
Anaconda prompt
." This is a modified version of Windows Command Prompt (
Installing Git in Anaconda PromptIn Anaconda prompt, copy and paste the following to install git: conda install -y git
That's it! Continue to STEP 2. |
Run Snakespeare via Git Bash (good for beginning users)
Installing Miniconda3 for WindowsHead over to the Anaconda website and download a Windows installer for Miniconda3 .
Run the installer and follow the instructions to complete your installation of Miniconda3. Installing Git + Git Bash for WindowsHead to the git website and download an installer for Windows. Run the installer and follow the instructions to complete the installation. This software bundle includes git as well as Git Bash, which is a terminal app that you can use to run Snakespeare. Important: While installing Git for Windows, be sure to check the box to add "Git Bash here" to the File Explorer context menu. You'll need it for the next step. Enabling Conda in Git Bash
To enable Conda within Git Bash, you'll need to add the Conda startup script to your
From the Start menu, search for "Miniconda3" and click "Open File Location." Within that folder, navigate to
Run the following command to to add the Conda startup script to your
echo ". '${PWD}'/conda.sh" >> ~/.bashrc
After that, close the terminal window.
Finally, let's double-check that conda is working in your new Git Bash terminal. From the Start menu, open
Git Bash
. Type
|
Run Snakespeare via Windows Subsystem for Linux (advanced users)
If you are already using Windows Subsystem for Linux, follow the instructions below for how to install miniconda and git in your Ubuntu terminal.
Installing Git in WSLHead to the git website and follow the installation instructions for Ubuntu. Installing Miniconda in WSLHead to the Anaconda website for instructions to download and run a Miniconda installer for Linux. After installing git and miniconda, close any terminal windows you have open and continue to STEP 2. |
Installing Git for Mac
On your Mac, open Terminal. Type
Installing Miniconda for Mac
Done! Make sure you close any terminal windows that you have open, then continue to STEP 2. |
Linux desktop users
Installing Git for LinuxHead to the git website for instructions to install git with your distribution's package manager. Installing Miniconda for LinuxHead to the Anaconda website for instructions to download and run a Miniconda installer. After installing git and miniconda, close any terminal windows you have open and continue to STEP 2. |
Linux server users
If you would like to run Snakespeare on a work or lab server, check with your supervisor or sysadmin to see if git and conda are installed already. If so, continue to STEP 2. Otherwise, if you need to install software (and have permission to do so), follow the instructions below. Installing Miniconda on a Linux ServerTo install miniconda from the command line: wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
The installer will ask you some questions to complete installation. Review and accept the license, accept or change home location, and answer yes to placing it in your path. To finish configuring miniconda: source $HOME/.bashrc
Installing Git on a Linux ServerTo install git : conda install git
|
STEP 2: Clone the repository
Open a new terminal window and navigate to where you want to download Snakespeare.
If you are not sure, I recommend you copy and paste these commands to
make a new directory called
GitHub_repos
, then "change directory" into the folder:
mkdir GitHub_repos
cd GitHub_repos
Copy and paste these commands to clone this repository and then "change directory" into the folder.
git clone https://github.com/lisakmalins/Snakespeare.git
cd Snakespeare
STEP 3: Build and activate the conda environment
When you
build the conda environment
, Conda obtains all the software listed in
environment.yaml
. You only need to do this step once.
conda env create -f environment.yaml
Finally, you will need to activate the environment . The environment is named "snakespeare," and the software will only be accessible while the environment is active.
conda activate snakespeare
Note: for older versions of Anaconda, you may need to use the command
source activate snakespeare
instead.
When you want to deactivate the environment later, you can do so with the command
conda deactivate
.
STEP 4: Run Snakespeare
Run the snakemake workflow like this:
snakemake
That's it! The workflow should finish within a few seconds. The output plot showing all dialogue statistics will appear in the folder
Snakespeare/data/plots/
.
Code Snippets
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | suppressPackageStartupMessages(library(readr)) # dplyr is required for most of the dataframe manipulation. suppressPackageStartupMessages(library(dplyr)) ##-------------------- Read arguments --------------------## args <- commandArgs(trailingOnly = TRUE) input = args[0:(length(args)-2)] play = args[length(args)-1] output = args[length(args)] ##-------------------- Determine name of metrics --------------------## # Determine metric from each filename by removing name of play metrics <- sapply(input, function(f) { sub(paste0(".*", play, "_?"), "", sub("\\..*", "", f)) }) ##-------------------- Read data --------------------## data <- list() for (f in input) { # Read data from file data[[f]] <- read_tsv(f, col_names=c("character", metrics[[f]]), col_types=cols()) } ##-------------------- Join data --------------------## # Full outer join data together all_data <- # Use full outer join to combine speeches and lines into one table data[[input[1]]] %>% full_join(data[[input[2]]], by="character") %>% # Determine average lines per speech mutate(avg_speech_length=total_lines/num_speeches) %>% # Sort descendingly by first metric arrange(desc(.[[2]])) %>% # Add column for name of play mutate(play=play) print(all_data) ##-------------------- Write output --------------------## write_tsv(all_data, output) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | from collections import defaultdict # Find start of play (skip title and character list) def FindPlayStart(lines): for i in range (0, len(lines)): if lines[i][:5] == "SCENE": return i return -1 # Read in list of characters with open(snakemake.input[1], 'r') as characters_input: characters = [l.strip() for l in characters_input.readlines()] # Count each character's line blocks speeches_by_character = defaultdict(int) with open(snakemake.input[0], 'r') as play_input: lines = play_input.readlines() i = FindPlayStart(lines) while i < len(lines): # Get line and strip to compare with characters list line = lines[i].rstrip('\n').rstrip('.').lstrip() if line in characters: speeches_by_character[line] += 1 i += 1 # Output to file with open(snakemake.output[0], 'w') as output: for c, f in speeches_by_character.items(): output.write(f"{c}\t{f}\n") |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | from collections import defaultdict # Find start of play (skip title and character list) def FindPlayStart(lines): for i in range (0, len(lines)): if lines[i][:5] == "SCENE": return i return -1 # Read in list of characters with open(snakemake.input[1], 'r') as characters_input: characters = [l.strip() for l in characters_input.readlines()] # Count each character's lines lines_by_character = defaultdict(int) with open(snakemake.input[0], 'r') as play_input: lines = play_input.readlines() i = FindPlayStart(lines) while i < len(lines): # Get line and strip to compare with characters list line = lines[i].rstrip('\n').rstrip('.').lstrip() if line in characters: # If char name found, count how many lines follow while True: i += 1 # Stop if i out of range (end of play) # or if newline encountered (end of line block) if i >= len(lines) or lines[i] == "\n": break # Increment num lines for this character lines_by_character[line] += 1 else: i += 1 # Output to file with open(snakemake.output[0], 'w') as output: for c, f in lines_by_character.items(): output.write(c + "\t" + str(f) + '\n') |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | suppressPackageStartupMessages(library(readr)) # dplyr is required for most of the dataframe manipulation. suppressPackageStartupMessages(library(dplyr)) # tidyr is required specifically for the pivot_longer function. suppressPackageStartupMessages(library(tidyr)) # ggplot2 is required for plotting the data. suppressPackageStartupMessages(library(ggplot2)) # yaml is required for reading the configuration file (config.yaml). suppressPackageStartupMessages(library(yaml)) ##-------------------- Read arguments --------------------## args <- commandArgs(trailingOnly = TRUE) input = args[0:(length(args)-1)] output = args[length(args)] ##-------------------- Read config.yaml --------------------## config <- read_yaml("config.yaml") ##-------------------- Prepare list of plays and metrics --------------------## plays <- config[["plays"]] metrics <- config[["metrics"]] ##-------------------- Read data --------------------## # Concatenate data for all plays into a single table all_data_rbind <- do.call("rbind", lapply(input, function(f) { read_tsv(f, col_names=TRUE, col_types=cols()) %>% # Set factor levels of "character" column in the order that they appear mutate(character = factor(character, levels=rev(character))) })) # Filter out characters with few lines if ("total_lines_cutoff" %in% names(config)) { total_lines_cutoff <- config[["total_lines_cutoff"]] } else { total_lines_cutoff <- 25 } print(paste("Characters with fewer than", total_lines_cutoff, "will be excluded.")) print(paste("The following rows will be dropped:")) all_data_rbind %>% filter(total_lines <= total_lines_cutoff) all_data_rbind <- all_data_rbind %>% filter(total_lines > total_lines_cutoff) # Pivot data from "wide" into "long" format to prepare for `facet_wrap` all_data_longer <- all_data_rbind %>% pivot_longer(cols=all_of(names(metrics)), names_to="metric", values_to="value") # Set factor levels to control order of facets in facet-wrapped plot all_data_longer$metric <- factor(all_data_longer$metric, levels = names(metrics)) ##-------------------- Plot data --------------------## ggplot(all_data_longer, aes(x=character, y=value)) + geom_col() + # Add title but suppress x- and y- axis labels labs(y=NULL, x=NULL, title=paste("Dialogue statistics for Shakespeare characters")) + # Grid plots by play and metric facet_grid(rows=vars(play), cols=vars(metric), scales="free", # Use more descriptive labels prepared earlier labeller=labeller(metric = unlist(metrics), play = unlist(plays))) + # Make the bar graph horizontal to more easily read character names coord_flip() + # Add number label floating next to bars geom_text(aes(label=round(value, 1)), hjust = -0.2, color="#333333") + # Add a bit of extra space for the labels # Vector is c(mult[x], add[x], mult[y], add[y]) scale_y_discrete(expand=c(0, 0, 0.2, 0)) + # Tweak theme theme_bw() + theme(text=element_text(size=18)) + theme(plot.title=element_text(hjust=0.5)) + theme(plot.caption=element_text(hjust=0, color="gray30")) # Save plot ggsave( output, height=8.5, width=11, plot=last_plot()) |
28 29 | script: "scripts/count_speeches.py" |
38 39 | script: "scripts/count_total_lines.py" |
50 51 | shell: "Rscript scripts/calculate_speech_length.R {input} \"{wildcards.play}\" {output}" |
59 60 | shell: "Rscript scripts/plot_all_metrics.R {input} {output}" |
Support
- Future updates
Related Workflows





