Pipeline to pull microbial reads from WGS data and perform metagenomic analysis

public 1yr ago Version: v1.1 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Doi for manuscript: https://doi.org/10.12688/wellcomeopenres.19155.1

Please follow the tutorial in my Jupyter Book Available Here: https://aidanfoo96.github.io/MINUUR/ for reproduction of my analysis or to apply in your host of interest :)

MINUUR is a snakemake pipeline I developed to extract non-host sequencing reads from mosquito whole genome sequencing data and utilise a range of metagenomic analyses to characterise potential host-associated microbes. Its application can be applied to other host-associated WGS data. MINUUR aims to leverage pre-existing WGS data to recover microbial information pertaining to host associated microbiomes.

MINUUR utilises:

KRAKEN2: Classify taxa from unmapped read sequences
KrakenTools: extract classified reads for downstream analysis
BRACKEN: reestimate taxonomic abundance from KRAKEN2
MetaPhlan3: Classify taxa using marker genes
MEGAHIT: Metagenome assemblies using unmapped reads
QUAST: Assembly statistics from MEGAHIT assemblies
MetaBat2: Bin contiguous sequences from MEGAHIT
CheckM: Assess bin quality from MetaBat2

Installation of Snakemake

MINUUR is run using the workflow manager Snakemake

Snakemake is best installed using the package manager Mamba

Once Mamba is installed run

mamba create -c bioconda -c conda-forge --name snakemake snakemake

Installation of MINUUR

Use git clone https://github.com/aidanfoo96/MINUUR/ and cd MINUUR/workflow . This is the reference point from which the pipeline will be run. See the JupyterBooks page for a full tutorial on establishing the configuration to run this pipeline.

Update 09/05/2023:

Added Github actions
Dummy dataset now included in workflow/data , tutorial for running this is included in the JupyterBooks page. Use this to ensure the pipeline works on your machine.
Added the option to run BUSCO to help assess eukaryotic contamination in MAGs

Any feedback or bugs please open an issue or contact: aidan.foo@lstmed.ac.uk

Code Snippets

script: 
    "../scripts/check_inputs.py"

SnakeMake From line 50 of rules/01_ReadQc.smk

wrapper: 
    "v0.80.1/bio/fastqc"

SnakeMake FastQC From line 75 of rules/01_ReadQc.smk

wrapper:
    "v1.21.6/bio/cutadapt/pe"

SnakeMake From line 99 of rules/01_ReadQc.smk

wrapper: 
    "v0.80.1/bio/fastqc"

SnakeMake FastQC From line 122 of rules/01_ReadQc.smk

wrapper: 
    "0.79.0/bio/samtools/stats"

SnakeMake From line 20 of rules/02a_BamProcessFromFastq.smk

shell: 
    "sed -n '8p;14p;16p' {input.sample} > {output.txt}" 

SnakeMake From line 33 of rules/02a_BamProcessFromFastq.smk

shell: 
    "samtools view -b -f 12 -F 256 {input.sample} > {output.bam} 2> {log}"

SnakeMake SAMtools From line 70 of rules/02a_BamProcessFromFastq.smk

shell: 
    "bamToFastq -i {input.sample} -fq {output.fastq1} -fq2 {output.fastq2}"

SnakeMake From line 88 of rules/02a_BamProcessFromFastq.smk

shell: 
    "sed -n '8p;14p;16p' {input.sample} > {output.txt}" 

SnakeMake From line 55 of rules/02b_BamProcessRaw.smk

shell: 
    "samtools view -b -f 4 {input} > {output.bam} 2> {log}"

SnakeMake SAMtools From line 89 of rules/02b_BamProcessRaw.smk

shell: 
    "bamToFastq -i {input.sample} -fq {output.fastq1} -fq2 {output.fastq2}"

SnakeMake From line 106 of rules/02b_BamProcessRaw.smk

shell: 
    """
    extract_kraken_reads.py -k {input.krak_file} \
    -s1 {input.r1} -s2 {input.r2} \
    --output {output.read1} \
    --output2 {output.read2} \
    --fastq-output \
    --taxid {params.taxa} \
    --include-children \
    --report {input.krak_reprt} 2> {log}
    """

SnakeMake From line 105 of rules/03_ReadClassification.smk

shell: 
    "kreport2mpa.py -r {input.krak_file} -o {output.mpa_txt} --display-header"

SnakeMake From line 131 of rules/03_ReadClassification.smk

shell: 
    "combine_mpa.py -i {input.mpa_files} -o {output.combined_report}"

SnakeMake From line 148 of rules/03_ReadClassification.smk

script: 
    "../scripts/generate_kraken_summaries.R"

SnakeMake From line 169 of rules/03_ReadClassification.smk

script: 
    "../scripts/combined_kraken_alignment_stat_ffq.R"

SnakeMake From line 186 of rules/03_ReadClassification.smk

shell: 
    "sed -n '1p;2p' {input.krak_mpa_report} > {output.classified_summaries}"

SnakeMake From line 199 of rules/03_ReadClassification.smk

script: 
    "../scripts/plot_classifiedVsunclassified_reads.R"

SnakeMake From line 231 of rules/03_ReadClassification.smk

script: 
    "../scripts/plot_kraken_heatmaps.R"

SnakeMake From line 253 of rules/03_ReadClassification.smk

script: 
    "../scripts/plot_kraken_spatial.R"

SnakeMake From line 275 of rules/03_ReadClassification.smk

shell: 
    "merge_metaphlan_tables.py {input.cln_out} > {output.concat_tbl}" 

SnakeMake From line 324 of rules/03_ReadClassification.smk

shell: 
    "grep 's__\|UNKNOWN' {input.txt_out} | cut -f1,3 > {output.tbl_out}"

SnakeMake From line 337 of rules/03_ReadClassification.smk

script:
    "../scripts/generate_metaphlan_summaries.R" 

SnakeMake From line 354 of rules/03_ReadClassification.smk

script: 
    "../scripts/plot_bracken_results.R"

SnakeMake From line 419 of rules/03_ReadClassification.smk

wrapper: 
    "0.80.2/bio/bwa/index"

SnakeMake From line 50 of rules/06_MapCoverageBin.smk

wrapper:
   "v1.17.3/bio/bwa/mem"

SnakeMake From line 78 of rules/06_MapCoverageBin.smk

script:
    "../scripts/plot_checkm_QA.R" 

SnakeMake From line 88 of rules/07_BinQA.smk

wrapper:
    "v1.28.0/bio/busco"

SnakeMake From line 124 of rules/07_BinQA.smk

import sys
import os
import pandas as pd
import numpy as np


sample_list = pd.read_csv(snakemake.params['sample_list'], sep  = "\t")
automatic_input = snakemake.params['automatic_input']

# Check sample names given in the fastq_samples.tsv file match the given fastq files

if automatic_input is False: 
    sample_table = pd.read_csv(snakemake.params['sample_table'], sep = "\t")
    assert np.isin(sample_table['sampleID'], sample_list['sampleID']).all(), f"samples in your sample table are not present in your sample list. Please ensure all samples match"
else:
    for sample in sample_list['sampleID']:
        for num in [1, 2]:
            fastq_sample = f"../resources/{sample}_{num}.fastq.gz"
            assert os.path.isfile(fastq_sample), "your sample names given in 'samples_fastq.tsv' do not match your fastq files. Please rename your files accrordingly"

print("Inputs Passed :D")

Python Pandas numpy From line 3 of scripts/check_inputs.py

library(tidyverse)
library(MetBrewer)

#### Load Data ####
concatenated_alignments <- read_tsv(snakemake@input[["alignment_stats"]], col_names = c("SN", "sequence", "num_sequences", "file_samples"))
concatenated_alignments$file_samples[concatenated_alignments$file_samples == "# excluding supplementary and secondary reads"] <- ""

#### Get List of Sample Names ####
sample_names <- concatenated_alignments %>%
  separate(file_samples, c("file", "sample"), sep = "stats_ffq/") %>%
  separate(sample, c("sample", "junk"), sep = "_align") %>%
  select(sequence, num_sequences, sample) %>%
  group_by(sample) %>%
  summarise(n()) %>%
  select(sample) %>%
  filter(sample != "NA")

sample_names <- as.vector(sample_names$sample)

#### Clean Data ####
concatenated_alignments_clean <- concatenated_alignments %>%
  select(!SN) %>%
  separate(file_samples, c("file", "sample"), sep = "stats_ffq/") %>%
  separate(sample, c("sample", "junk"), sep = "_align") %>%
  select(sequence, num_sequences, sample) %>%
  mutate(sample = replace(sample, is.na(sample), sample_names))

##### Generate Summary Tables ######
# Wider, 'human readable' format with proportions
concatenated_alignments_clean_wide <- concatenated_alignments_clean %>% 
  group_by(sample) %>%
  pivot_wider(names_from = sequence, values_from = num_sequences) %>%
  mutate(proportion_mapped = `reads mapped:` / `raw total sequences:` * 100) 

write.table(concatenated_alignments_clean_wide, file = snakemake@output[["human_read_tbl"]], sep = "\t", row.names = FALSE)
write.table(concatenated_alignments_clean, file = snakemake@output[["long_read_tbl"]], sep = "\t", row.names = FALSE)

#### Plot Summaries ####
alignment_statistics_plot <- concatenated_alignments_clean %>%
  ggplot() + 
  aes(x = sample, y = num_sequences, fill = sequence) %>%
  geom_bar(stat = "identity", position = "dodge") + 
  theme_bw(base_size = 15) + 
  scale_fill_manual(values = met.brewer("Redon", n = 3, type = "discrete"), 
                    labels = c("Raw Read Number", "Reads Mapped", "Reads Unmapped")) + 
  xlab("Sample") + 
  ylab("Read Number") + 
  theme(legend.title = element_blank(), 
        axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1), 
        legend.position = "top") 

ggsave(snakemake@output[["alignment_stat_plot"]], alignment_statistics_plot)

R ggplot2 tidyverse MetBrewer From line 3 of scripts/combined_kraken_alignment_stat_ffq.R

library(tidyverse)

genus_delim <- function(){
  if(snakemake@params[["kraken_db_type"]]){
    return("g__")
  }else{
    return("g__g__")
  }
}


spp_delim <- function(){
  if(snakemake@params[["kraken_db_type"]]){
    return("s__")
  }else{
    return("s__s__")
  }
}

genus <- genus_delim()
spp <- spp_delim()

data <- read_tsv(snakemake@input[["combined_report"]])

##### Get Domain #####
kingdom_data <- data %>%
  filter(!grepl("p__|g__|s__|c__|o__|f__", `#Classification`)) %>%
  separate(`#Classification`, c("junk", "kingdom"), sep = "^k__") %>%
  select(!junk)

kingdom_data_long <- kingdom_data %>%
  pivot_longer(cols = !kingdom, names_to = "study", values_to = "read_number") 

##### Get genus  #####
genus_data <- data %>%
  filter(grepl("g__", `#Classification`)) %>%
  separate(`#Classification`, c("junk", "genus"), sep = genus) %>%
  select(!junk) %>%
  filter(!grepl("s__", genus)) 

genus_data_long <- genus_data %>%
  pivot_longer(cols = !genus, names_to = "study", values_to = "read_number") 

##### Get species  #####
species_data <- data %>% 
  filter(grepl("s__", `#Classification`)) %>%
  separate(`#Classification`, c("junk", "species"), sep = spp) %>%
  select(!junk)

species_data_long <- species_data %>% 
  pivot_longer(cols = !species, names_to = "study", values_to = "read_number") 

##### How many reads per study? #####
total_reads_per_study <- species_data_long %>% 
  group_by(study) %>%
  summarise(total_reads = sum(read_number)) 

classified_reads_plot <- total_reads_per_study %>% 
  ggplot() + 
  aes(x = study, y = total_reads) + 
  geom_col() + 
  theme_classic() + 
  coord_flip()

##### Species Classification Passed a specific threshold (unfinished) ##### 
species_data_filtered <- species_data_long %>%
  filter(read_number > 1000)

genus_data_filtered <- genus_data_long %>%
  filter(read_number > 1000)

#### Export Dataframes to TSV files ####
write.table(kingdom_data_long, file = snakemake@output[["kingdom_table"]], sep = "\t", row.names = FALSE)
write.table(genus_data_long, file = snakemake@output[["genus_table"]], sep = "\t", row.names = FALSE)
write.table(species_data_long, file = snakemake@output[["spp_table"]], sep = "\t", row.names = FALSE)
write.table(total_reads_per_study, file = snakemake@output[["classified_reads"]], sep = "\t", row.names = FALSE)

#### Export Classified Reads Plot ####
pdf(snakemake@output[["classified_reads_plot"]])
print(classified_reads_plot)
dev.off()

R ggplot2 tidyverse From line 2 of scripts/generate_kraken_summaries.R

library(tidyverse)

metaphlan_tbl <- read_tsv(snakemake@input[["concat_tbl"]], skip = 1) %>%
  select(!NCBI_tax_id)

#### Get Kingdom ####
metaphlan_tbl_king <- metaphlan_tbl %>%
  filter(!grepl("p__|g__|s__|c__|o__|f__", clade_name)) %>%
  mutate_at("clade_name", str_replace, "k__", "")

metaphlan_tbl_king_long <- metaphlan_tbl_king %>%
  pivot_longer(cols = !clade_name, 
               names_to = "samples", 
               values_to = "rel_abund")

#### Get Genus ####
metaphlan_tbl_genus <- metaphlan_tbl %>%
  filter(grepl("g__", clade_name)) %>%
  separate(clade_name, c("junk", "genus"), sep = "g__") %>%
  select(!junk) %>%
  filter(!grepl("s__", genus))

metaphlan_tbl_genus_long <- metaphlan_tbl_genus %>% 
  pivot_longer(cols = !genus, 
               names_to = "samples", 
               values_to = "rel_abund")

#### Get Species ####
metaphlan_tbl_spp <- metaphlan_tbl %>%
  filter(grepl("s__", clade_name)) %>%
  separate(clade_name, c("junk", "species"), sep = "s__") %>%
  select(!junk)

metaphlan_tbl_spp_long <- metaphlan_tbl_spp %>% 
  pivot_longer(cols = !species,
               names_to = "samples", 
               values_to = "rel_abund")

#### Export Dataframes to TSV files ####
write.table(metaphlan_tbl_king_long, file = snakemake@output[["kingdom_table"]], sep = "\t", row.names = FALSE)
write.table(metaphlan_tbl_genus_long, file = snakemake@output[["genus_table"]], sep = "\t", row.names = FALSE)
write.table(metaphlan_tbl_spp_long, file = snakemake@output[["spp_table"]], sep = "\t", row.names = FALSE)

R tidyverse From line 2 of scripts/generate_metaphlan_summaries.R

library(tidyverse)
# Get colour pallette
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# Set Read Filter
filter <- as.character(snakemake@params[["bracken_threshold"]])
filter <- as.numeric(filter)

#### Load Data #####
conc_data <- read_tsv(snakemake@input[["conc_input"]], col_names = c("species",
                                                                       "taxa_id", 
                                                                       "taxa_lvl", 
                                                                       "kraken_assigned_reads", 
                                                                       "added_reads", 
                                                                       "new_est_reads", 
                                                                       "fraction_total", 
                                                                       "file_path")) %>%
  filter(species != "name")

conc_data_clean <- conc_data %>%
  separate(file_path, c("file", "sample"), sep = "out/") %>%
  separate(sample, c("sample", "junk"), sep = "_bracken") %>%
  select(sample, species, taxa_id, taxa_lvl, kraken_assigned_reads, added_reads, new_est_reads, fraction_total) %>%
  mutate_at(c("taxa_id", "kraken_assigned_reads", "added_reads", "new_est_reads", "fraction_total"), as.numeric)


#### Added Reads Across Samples ####

sample_summary <- conc_data_clean %>%
  group_by(sample) %>%
  summarise(kraken_reads = sum(kraken_assigned_reads), 
            added_reads = sum(added_reads), 
            total_new_reads = sum(new_est_reads))

sample_summary_long <- sample_summary %>%
  pivot_longer(cols = !sample, 
               names_to = "classification_type", 
               values_to = "num_reads") 

sample_summary_plot <- sample_summary %>%
  pivot_longer(cols = !sample, 
               names_to = "classification_type", 
               values_to = "num_reads") %>%
  filter(classification_type != "total_new_reads") %>%
  ggplot() + 
  aes(reorder(x = sample, desc(num_reads)), y = num_reads, fill = classification_type) + 
  geom_bar(stat = "identity", position = "stack") + 
  coord_flip() + 
  scale_fill_manual(values = cbPalette) + 
  theme_classic(base_size = 25) + 
  xlab("Sample") + 
  ylab("Number of Reads")

ggsave(snakemake@output[["added_reads_plot"]], height = 15, width = 20)

#### Stratified Heatmap of Relative Abundance Per Per Sample ####

conc_data_clean$species <- sub(" ", "_", conc_data_clean$species)

conc_data_clean_sep <- conc_data_clean %>%
  separate(species, into = c("Genus", "Species"), sep = "_") %>%
  filter(kraken_assigned_reads > filter)

conc_data_clean_genus_group <- conc_data_clean_sep %>%
  group_by(Genus) %>%
  summarise(n())

plot_stratified_heatmap <- function(){
  for(i in conc_data_clean_genus_group$Genus) {
  print(
    conc_data_clean_sep %>%
      filter(Genus == i) %>% 
      ggplot() + 
      aes(x = sample, y = Species) + 
      geom_tile(aes(fill = new_est_reads), linejoin = "round") + 
      scale_fill_gradientn(colours = c("white", "#F0E442", "#E69F00")) +
      facet_grid(rows = "Genus") + 
      theme_classic(base_size = 25) + 
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + 
      xlab("Sample"))
  }
}

pdf(snakemake@output[["bracken_stratified_heatmap"]], height = 20, width = 20)
plot_stratified_heatmap() 
dev.off()

##### Heatmap of Species #####
plot_heatmap_per_sample <- function(data, filter_thresh){
  data %>%
    filter(new_est_reads > filter_thresh) %>%
    select(sample, species, fraction_total) %>%
    ggplot() +
    aes(x = sample, y = species) +
    geom_tile(aes(fill = log(fraction_total))) + 
    scale_fill_gradientn(colours = c("white", "orange", "blue")) + 
    theme_classic() + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + 
    ggtitle(paste("Log Relative Abundance of Microbial Species filter: Filtered > ", filter_thresh, sep = ""))
}

pdf(snakemake@output[["bracken_overall_heatmap"]], height = 20, width = 20)
plot_heatmap_per_sample(conc_data_clean, filter)
dev.off()

R ggplot2 tidyverse From line 3 of scripts/plot_bracken_results.R

library(tidyverse)
library(MetBrewer)

# Load Appropriate Columns 
columns <- c("bin", "marker_lineage", "num_genomes", "num_markers", "num_marker_sets", "completeness", "contamination",
             "strain_heterogen", "genome_size_bp", "num_ambiguous_bases", "num_scaffolds", "num_contigs", "N50_scaffold", "N50_contig",
             "mean_scaffold_length", "mean_contig_length", "longest_scaffold", "longest_contig", 
             "GC", "GC_std", "coding_density", "translation_table", "num_predicted_genes", "0", "1", "2", "3", "4", "5+", "sample")

# Read Data, skip first row and remove the redundant header columns from previous step
bin_stats <- read_tsv(snakemake@input[["checkm_conc_result"]], col_names = columns, skip = 1) %>%
  filter(bin != "Bin Id")

# Conver to Numeric Columns
bin_stats$completeness <- as.numeric(bin_stats$completeness)
bin_stats$contamination <- as.numeric(bin_stats$contamination)
bin_stats$genome_size_bp <- as.numeric(bin_stats$genome_size_bp)
bin_stats$num_contigs <- as.numeric(bin_stats$num_contigs)
bin_stats$mean_contig_length <- as.numeric(bin_stats$mean_contig_length)
bin_stats$N50_contig <- as.numeric(bin_stats$N50_contig)

# Get breaks for plotting 
bin_stats$col <- cut(bin_stats$completeness, 
                        breaks = c(-Inf, 80, Inf), 
                        labels = c("<80", ">=80"))

bin_stats$contig_cut <- cut(bin_stats$num_contigs, 
                     breaks = c(-Inf, 250, Inf), 
                     labels = c("<=250", ">250"))

# Plot Completeness Vs Contamination Per Sample
CompletenessVsContam <- bin_stats %>%
  ggplot() + 
  aes(x = completeness, y = contamination, size = genome_size_bp, col = col) %>%
  geom_point(alpha = 0.5) + 
  scale_x_continuous(limits = c(0, 100)) +
  theme_minimal() + 
  scale_color_manual(values = c("blue", "orange"))

pdf(snakemake@output[["CompletenessVsContam"]])
print(CompletenessVsContam)
dev.off()

#### Plot Completeness Vs Contamination Per Sample
bin_stats_sample_split <- bin_stats %>%
  separate(bin, into = c("sample", "bin_num"), sep = "\\.")

compvscontam_samp <- bin_stats_sample_split %>%
  ggplot() + 
  aes(x = completeness, y = contamination, col = sample) %>%
  geom_point(alpha = 0.8, size = 4) + 
  scale_x_continuous(limits = c(0, 100 )) +
  geom_vline( xintercept = 80, linetype = "dotted", 
              color = "green", size = 1) + 
  geom_vline(xintercept = 50, linetype = "dotted", 
             color = "orange", size = 1) + 
  theme_bw(base_size = 18) + 
  scale_color_manual(values = met.brewer("Redon", 10, type = "discrete")) + 
  xlab("Completeness") + 
  ylab("Contamination") + 
  labs(colour = "Sample") + 
  theme(legend.position = "bottom", 
        legend.direction = "horizontal", 
        legend.text = element_text(size = 10))

pdf(snakemake@output[["CompletenessVsContam_Per_Sample"]])
print(compvscontam_samp)
dev.off()

#### Plot Bar Chart with Completeness vs Contamination Per Sample
bin_stats_bar <- bin_stats_sample_split %>%
  unite("sample_bin", sample:bin_num) %>%
  select(completeness, contamination, sample_bin) %>%
  pivot_longer(c(completeness, contamination), 
              names_to = "bin_rank", 
              values_to = "value") %>%
  ggplot() + 
  aes(reorder(x = sample_bin, desc(value)), y = value, fill = bin_rank) + 
  geom_bar(stat = "identity", position = "dodge") + 
  theme_bw(base_size = 15) + 
  coord_flip() + 
  scale_fill_manual(values = met.brewer("Redon", 2, type = "discrete"), labels = c("Completeness", "Contamination")) + 
  geom_hline(yintercept = 80, linetype = "dotted", 
              color = "green", size = 1) + 
  geom_hline(yintercept = 50, linetype = "dotted", 
             color = "orange", size = 1) + 
  xlab("Sample") + 
  ylab("Contamination and Completion Score (%)") + 
  labs(fill = "CheckM Completion vs Contamination Score") + 
  theme(legend.position = "bottom", 
        legend.direction = "vertical")

pdf(snakemake@output[["BarChartCompletenessContamination"]])
print(bin_stats_bar)
dev.off()

#### Plot the number of contigs vs completeness
NumContigsVsCompleteness <- bin_stats %>%
  ggplot() + 
  aes(x = num_contigs, y = completeness, size = mean_contig_length, col = contig_cut) + 
  geom_point(alpha = 0.5) + 
  theme_minimal() + 
  scale_color_manual(values = c("blue", "orange"))

pdf(snakemake@output[["NumContigsVsCompleteness"]])
print(NumContigsVsCompleteness)
dev.off()

R ggplot2 tidyverse MetBrewer From line 2 of scripts/plot_checkm_QA.R

library(tidyverse)
library(MetBrewer)

#### Import Kraken Data
concatenated_kraken_summaries <- read_tsv(snakemake@input[["concatenated_kraken_summary"]], 
                                            col_names = c("proportion_of_reads", 
                                                        "number_of_reads", 
                                                        "number_of_reads2",
                                                        "read_code", 
                                                        "read_code2", 
                                                        "classification_status", 
                                                        "file_path"))

concatenated_kraken_summaries_clean <- concatenated_kraken_summaries %>%
  separate(file_path, into = c("junk", "filename"), sep = "classified_summary/") %>%
  separate(filename, into = c("sample", "junk2"), sep = "_classified_summary.txt") %>%
  select(sample, proportion_of_reads, number_of_reads, classification_status) 


#### Plot Proportion of Classified vs Unclassified Reads from Kraken
plot_kraken_proportion <- function(kraken_table){

  concatenated_kraken_summaries_clean_plot <- kraken_table %>%
    ggplot() + 
    aes(x = sample, y = number_of_reads, fill = classification_status) + 
    geom_bar(stat = "identity", position = "fill") + 
    theme_bw(base_size = 15) + 
    scale_fill_manual(values = met.brewer("Redon", n = 2, type = "continuous"), 
                      labels = c("Classified Reads", "Unclassified Reads")) + 
    xlab("Sample") + 
    ylab("Proportion of Reads") + 
    theme(legend.title = element_blank(), 
          axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1), 
          legend.position = "top")

  return(concatenated_kraken_summaries_clean_plot)

}

kraken_classification_proportions <- plot_kraken_proportion(concatenated_kraken_summaries_clean)

#### Save file
ggsave(snakemake@output[["classified_proportions"]], kraken_classification_proportions)

R ggplot2 tidyverse MetBrewer From line 2 of scripts/plot_classifiedVsunclassified_reads.R

library(tidyverse)
library(MetBrewer)

##### Import Data #####

spp_tbl <- read_tsv(snakemake@input[["spp_table"]])
genus_tbl <- read_tsv(snakemake@input[["genus_table"]]) 

##### Define User Read Threshold #####
filter_threshold <- as.character(snakemake@params[["strat_thresh"]])
filter_threshold <- as.numeric(filter_threshold)

genus_thres <- as.character(snakemake@params[["genus_thresh"]])
genus_thres <- as.numeric(genus_thres)

spp_thres <- as.character(snakemake@params[["species_thresh"]])
spp_thres <- as.numeric(spp_thres)

##### Clean and Prep Data #####
genus_tbl_clean <- genus_tbl %>%
  separate(study, into = c("sample", "junk"), sep = "_") %>%
  select(!junk) %>%
  rename("phylo_level" = genus)

spp_tbl_clean <- spp_tbl %>%
  separate(study, into = c("sample", "junk"), sep = "_") %>%
  select(!junk) %>%
  rename("phylo_level" = species)

##### Plot Function #####

kraken_heatmap <- function(clean_kraken_table, read_thresh, type){
  heatmap <- clean_kraken_table %>%
    filter(read_number > read_thresh) %>%
    filter(phylo_level != "Homo_sapiens") %>%
    ggplot() + 
    aes(x = sample, y = phylo_level) + 
    geom_tile(aes(fill = log(read_number)), linejoin = "round") + 
    scale_fill_gradientn(colours = met.brewer("Hiroshige", 4, type = "discrete")) +
    theme_bw(base_size = 20) +
    theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1), 
          legend.position = "top", 
          legend.text = element_text(size = 15), 
          legend.title = element_text(size = 15)) + 
    labs(fill = "Read Number (log)") + 
    xlab("Sample") + 
    ylab(type)

  return(heatmap)

}

species_heatmap <- kraken_heatmap(spp_tbl_clean, spp_thres, "Species")
genus_heatmap <- kraken_heatmap(genus_tbl_clean, genus_thres, "Genus")

ggsave(snakemake@output[["genus_heatmap"]], genus_heatmap, height = 20, width = 15)
ggsave(snakemake@output[["species_heatmap"]], species_heatmap, height = 20, width = 15)

##### Faceted Plot By Genus #####
spp_tbl_split <- spp_tbl_clean %>% 
  separate(phylo_level, into = c("Genus", "Species1", "Species2"), sep = "_") %>%
  unite("Species", Species1:Species2, na.rm = TRUE) 

genus_list <- spp_tbl_split %>%
  group_by(Genus) %>%
  summarise(n()) 

plot_species_facets <- function() {
  for(i in genus_list$Genus) {
    print(
      spp_tbl_split %>%
        filter(Genus == i) %>% 
        ggplot() + 
        aes(x = sample, y = Species) + 
        geom_tile(aes(fill = read_number), linejoin = "round") + 
        scale_fill_gradientn(colours = c("white", "#F0E442", "#E69F00")) +
        facet_grid(rows = "Genus") + 
        theme_classic(base_size = 25) + 
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
        )
    }
}

pdf(snakemake@output[["stratified_heatmap"]], height = 20, width = 20)
plot_species_facets()
dev.off()

R ggplot2 tidyverse MetBrewer From line 2 of scripts/plot_kraken_heatmaps.R

library(tidyverse)
library(MetBrewer)
library(treemapify)

##### Import Data #####

spp_tbl <- read_tsv(snakemake@input[["spp_table"]])
genus_tbl <- read_tsv(snakemake@input[["genus_table"]]) 

##### Define User Read Threshold #####
filter_threshold <- as.character(snakemake@params[["strat_thresh"]])
filter_threshold <- as.numeric(filter_threshold)

genus_thres <- as.character(snakemake@params[["genus_thresh"]])
genus_thres <- as.numeric(genus_thres)

spp_thres <- as.character(snakemake@params[["species_thresh"]])
spp_thres <- as.numeric(spp_thres)

##### Clean and Prep Data #####
genus_tbl_clean <- genus_tbl %>%
  separate(study, into = c("sample", "junk"), sep = "_") %>%
  select(!junk) %>%
  rename("phylo_level" = genus)

spp_tbl_clean <- spp_tbl %>%
  separate(study, into = c("sample", "junk"), sep = "_") %>%
  select(!junk) %>%
  rename("phylo_level" = species)

##### Plot Spatial Plots #####
plot_spatial_classification <- function(classificaiton_tbl, read_thresh){
  tbl_clean_sum <- classificaiton_tbl %>%
    group_by(phylo_level) %>%
    summarise(summed_read_num = sum(read_number)) %>%
    filter(summed_read_num > read_thresh)

  spatial_plot <- tbl_clean_sum %>%
    filter(phylo_level != "Homo_sapiens") %>%
    filter(summed_read_num > read_thresh) %>%
    ggplot(aes(area = summed_read_num, fill = summed_read_num, label = phylo_level, subgroup = phylo_level)) +
    geom_treemap() + 
    geom_treemap_subgroup_border(colour = "white", size = 1) + 
    geom_treemap_text(colour = "white", place = "center", reflow = T) + 
    scale_fill_gradientn(colours = met.brewer("Hiroshige", 10, type = "discrete"), name = "Total Read Number") +
    theme_minimal(base_size = 15) + 
    theme(legend.position = "right")

  return(spatial_plot)
}

species_spatial_plot <- plot_spatial_classification(spp_tbl_clean, spp_thres)
genus_spatial_plot <- plot_spatial_classification(genus_tbl_clean, genus_thres)

ggsave(snakemake@output[["genus_spatial_plot"]], genus_spatial_plot, height = 20, width = 15)
ggsave(snakemake@output[["species_spatial_plot"]], species_spatial_plot, height = 20, width = 15)

R ggplot2 tidyverse MetBrewer treemapify From line 2 of scripts/plot_kraken_spatial.R

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


shell("samtools stats {extra} {snakemake.input} {region} > {snakemake.output} {log}")

Python Snakemake SAMtools From line 3 of stats/wrapper.py

__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2016, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"

from os import path

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if len(snakemake.input) == 0:
    raise ValueError("A reference genome has to be provided!")
elif len(snakemake.input) > 1:
    raise ValueError("Only one reference genome can be inputed!")

# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", "")

if len(prefix) > 0:
    prefix = "-p " + prefix

# Contrunction algorithm that will be used to build the database, default is bwtsw
construction_algorithm = snakemake.params.get("algorithm", "")

if len(construction_algorithm) != 0:
    construction_algorithm = "-a " + construction_algorithm

shell(
    "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}"
)

Python Snakemake BWA From line 1 of index/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")

Python Snakemake FastQC From line 3 of fastqc/wrapper.py

__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "koester@jimmy.harvard.edu, julianderuiter@gmail.com"
__license__ = "MIT"


import tempfile
from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
from snakemake_wrapper_utils.samtools import get_samtools_opts


# Extract arguments.
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
sort = snakemake.params.get("sorting", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")
samtools_opts = get_samtools_opts(snakemake)
java_opts = get_java_opts(snakemake)


index = snakemake.input.idx
if isinstance(index, str):
    index = path.splitext(snakemake.input.idx)[0]
else:
    index = path.splitext(snakemake.input.idx[0])[0]


# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements")


if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))


# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":
    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view {samtools_opts}"

elif sort == "samtools":
    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}"

elif sort == "picard":
    # Sort alignments using picard SortSam.
    pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}"

else:
    raise ValueError(f"Unexpected value for params.sort ({sort})")

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "(bwa mem"
        " -t {snakemake.threads}"
        " {extra}"
        " {index}"
        " {snakemake.input.reads}"
        " | " + pipe_cmd + ") {log}"
    )

Python Snakemake SAMtools Picard snakemake-wrapper-utils From line 1 of mem/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " --cores {snakemake.threads}"
    " {adapters}"
    " {extra}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}"
)

Python Snakemake Cutadapt From line 3 of pe/wrapper.py

__author__ = "Tessa Pierce"
__copyright__ = "Copyright 2018, Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell


log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")


mode = snakemake.params.get("mode")
assert mode in [
    "genome",
    "transcriptome",
    "proteins",
], "invalid run mode: only 'genome', 'transcriptome' or 'proteins' allowed"


lineage = lineage_opt = snakemake.params.get("lineage", "")
if lineage_opt:
    lineage_opt = f"--lineage {lineage_opt}"


with tempfile.TemporaryDirectory() as tmpdir:
    dataset_dir = snakemake.input.get("dataset_dir", "")
    if not dataset_dir:
        dataset_dir = f"{tmpdir}/dataset"

    shell(
        "busco"
        " --cpu {snakemake.threads}"
        " --in {snakemake.input}"
        " --mode {mode}"
        " {lineage_opt}"
        " {extra}"
        " --download_path {dataset_dir}"
        " --out_path {tmpdir}"
        " --out output"
        " {log}"
    )

    if snakemake.output.get("short_txt"):
        assert lineage, "parameter 'lineage' is required to output 'short_tsv'"
        shell(
            "cat {tmpdir}/output/short_summary.specific.{lineage}.output.txt > {snakemake.output.short_txt:q}"
        )
    if snakemake.output.get("short_json"):
        assert lineage, "parameter 'lineage' is required to output 'short_json'"
        shell(
            "cat {tmpdir}/output/short_summary.specific.{lineage}.output.json > {snakemake.output.short_json:q}"
        )
    if snakemake.output.get("full_table"):
        assert lineage, "parameter 'lineage' is required to output 'full_table'"
        shell(
            "cat {tmpdir}/output/run_{lineage}/full_table.tsv > {snakemake.output.full_table:q}"
        )
    if snakemake.output.get("miss_list"):
        assert lineage, "parameter 'lineage' is required to output 'miss_list'"
        shell(
            "cat {tmpdir}/output/run_{lineage}/missing_busco_list.tsv > {snakemake.output.miss_list:q}"
        )
    if snakemake.output.get("out_dir"):
        shell("mv {tmpdir}/output {snakemake.output.out_dir:q}")
    if snakemake.output.get("dataset_dir"):
        shell("mv {dataset_dir} {snakemake.output.dataset_dir:q}")