Pipeline to pull microbial reads from WGS data and perform metagenomic analysis
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Doi for manuscript: https://doi.org/10.12688/wellcomeopenres.19155.1
Please follow the tutorial in my Jupyter Book Available Here: https://aidanfoo96.github.io/MINUUR/ for reproduction of my analysis or to apply in your host of interest :)
MINUUR is a snakemake pipeline I developed to extract non-host sequencing reads from mosquito whole genome sequencing data and utilise a range of metagenomic analyses to characterise potential host-associated microbes. Its application can be applied to other host-associated WGS data. MINUUR aims to leverage pre-existing WGS data to recover microbial information pertaining to host associated microbiomes.
MINUUR utilises:
-
KRAKEN2: Classify taxa from unmapped read sequences
-
KrakenTools: extract classified reads for downstream analysis
-
BRACKEN: reestimate taxonomic abundance from KRAKEN2
-
MetaPhlan3: Classify taxa using marker genes
-
MEGAHIT: Metagenome assemblies using unmapped reads
-
QUAST: Assembly statistics from MEGAHIT assemblies
-
MetaBat2: Bin contiguous sequences from MEGAHIT
-
CheckM: Assess bin quality from MetaBat2
Installation of Snakemake
MINUUR is run using the workflow manager Snakemake
Snakemake is best installed using the package manager Mamba
Once Mamba is installed run
mamba create -c bioconda -c conda-forge --name snakemake snakemake
Installation of MINUUR
Use
git clone https://github.com/aidanfoo96/MINUUR/
and
cd MINUUR/workflow
. This is the reference point from which the pipeline will be run. See the JupyterBooks page for a full tutorial on establishing the configuration to run this pipeline.
Update 09/05/2023:
-
Added Github actions
-
Dummy dataset now included in
workflow/data
, tutorial for running this is included in the JupyterBooks page. Use this to ensure the pipeline works on your machine. -
Added the option to run BUSCO to help assess eukaryotic contamination in MAGs
Any feedback or bugs please open an issue or contact: aidan.foo@lstmed.ac.uk
Code Snippets
50 51 | script: "../scripts/check_inputs.py" |
75 76 | wrapper: "v0.80.1/bio/fastqc" |
99 100 | wrapper: "v1.21.6/bio/cutadapt/pe" |
122 123 | wrapper: "v0.80.1/bio/fastqc" |
20 21 | wrapper: "0.79.0/bio/samtools/stats" |
33 34 | shell: "sed -n '8p;14p;16p' {input.sample} > {output.txt}" |
70 71 | shell: "samtools view -b -f 12 -F 256 {input.sample} > {output.bam} 2> {log}" |
88 89 | shell: "bamToFastq -i {input.sample} -fq {output.fastq1} -fq2 {output.fastq2}" |
55 56 | shell: "sed -n '8p;14p;16p' {input.sample} > {output.txt}" |
89 90 | shell: "samtools view -b -f 4 {input} > {output.bam} 2> {log}" |
106 107 | shell: "bamToFastq -i {input.sample} -fq {output.fastq1} -fq2 {output.fastq2}" |
105 106 107 108 109 110 111 112 113 114 115 | shell: """ extract_kraken_reads.py -k {input.krak_file} \ -s1 {input.r1} -s2 {input.r2} \ --output {output.read1} \ --output2 {output.read2} \ --fastq-output \ --taxid {params.taxa} \ --include-children \ --report {input.krak_reprt} 2> {log} """ |
131 132 | shell: "kreport2mpa.py -r {input.krak_file} -o {output.mpa_txt} --display-header" |
148 149 | shell: "combine_mpa.py -i {input.mpa_files} -o {output.combined_report}" |
169 170 | script: "../scripts/generate_kraken_summaries.R" |
186 187 | script: "../scripts/combined_kraken_alignment_stat_ffq.R" |
199 200 | shell: "sed -n '1p;2p' {input.krak_mpa_report} > {output.classified_summaries}" |
231 232 | script: "../scripts/plot_classifiedVsunclassified_reads.R" |
253 254 | script: "../scripts/plot_kraken_heatmaps.R" |
275 276 | script: "../scripts/plot_kraken_spatial.R" |
324 325 | shell: "merge_metaphlan_tables.py {input.cln_out} > {output.concat_tbl}" |
337 338 | shell: "grep 's__\|UNKNOWN' {input.txt_out} | cut -f1,3 > {output.tbl_out}" |
354 355 | script: "../scripts/generate_metaphlan_summaries.R" |
419 420 | script: "../scripts/plot_bracken_results.R" |
50 51 | wrapper: "0.80.2/bio/bwa/index" |
78 79 | wrapper: "v1.17.3/bio/bwa/mem" |
88 89 | script: "../scripts/plot_checkm_QA.R" |
124 125 | wrapper: "v1.28.0/bio/busco" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import sys import os import pandas as pd import numpy as np sample_list = pd.read_csv(snakemake.params['sample_list'], sep = "\t") automatic_input = snakemake.params['automatic_input'] # Check sample names given in the fastq_samples.tsv file match the given fastq files if automatic_input is False: sample_table = pd.read_csv(snakemake.params['sample_table'], sep = "\t") assert np.isin(sample_table['sampleID'], sample_list['sampleID']).all(), f"samples in your sample table are not present in your sample list. Please ensure all samples match" else: for sample in sample_list['sampleID']: for num in [1, 2]: fastq_sample = f"../resources/{sample}_{num}.fastq.gz" assert os.path.isfile(fastq_sample), "your sample names given in 'samples_fastq.tsv' do not match your fastq files. Please rename your files accrordingly" print("Inputs Passed :D") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | library(tidyverse) library(MetBrewer) #### Load Data #### concatenated_alignments <- read_tsv(snakemake@input[["alignment_stats"]], col_names = c("SN", "sequence", "num_sequences", "file_samples")) concatenated_alignments$file_samples[concatenated_alignments$file_samples == "# excluding supplementary and secondary reads"] <- "" #### Get List of Sample Names #### sample_names <- concatenated_alignments %>% separate(file_samples, c("file", "sample"), sep = "stats_ffq/") %>% separate(sample, c("sample", "junk"), sep = "_align") %>% select(sequence, num_sequences, sample) %>% group_by(sample) %>% summarise(n()) %>% select(sample) %>% filter(sample != "NA") sample_names <- as.vector(sample_names$sample) #### Clean Data #### concatenated_alignments_clean <- concatenated_alignments %>% select(!SN) %>% separate(file_samples, c("file", "sample"), sep = "stats_ffq/") %>% separate(sample, c("sample", "junk"), sep = "_align") %>% select(sequence, num_sequences, sample) %>% mutate(sample = replace(sample, is.na(sample), sample_names)) ##### Generate Summary Tables ###### # Wider, 'human readable' format with proportions concatenated_alignments_clean_wide <- concatenated_alignments_clean %>% group_by(sample) %>% pivot_wider(names_from = sequence, values_from = num_sequences) %>% mutate(proportion_mapped = `reads mapped:` / `raw total sequences:` * 100) write.table(concatenated_alignments_clean_wide, file = snakemake@output[["human_read_tbl"]], sep = "\t", row.names = FALSE) write.table(concatenated_alignments_clean, file = snakemake@output[["long_read_tbl"]], sep = "\t", row.names = FALSE) #### Plot Summaries #### alignment_statistics_plot <- concatenated_alignments_clean %>% ggplot() + aes(x = sample, y = num_sequences, fill = sequence) %>% geom_bar(stat = "identity", position = "dodge") + theme_bw(base_size = 15) + scale_fill_manual(values = met.brewer("Redon", n = 3, type = "discrete"), labels = c("Raw Read Number", "Reads Mapped", "Reads Unmapped")) + xlab("Sample") + ylab("Read Number") + theme(legend.title = element_blank(), axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1), legend.position = "top") ggsave(snakemake@output[["alignment_stat_plot"]], alignment_statistics_plot) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | library(tidyverse) genus_delim <- function(){ if(snakemake@params[["kraken_db_type"]]){ return("g__") }else{ return("g__g__") } } spp_delim <- function(){ if(snakemake@params[["kraken_db_type"]]){ return("s__") }else{ return("s__s__") } } genus <- genus_delim() spp <- spp_delim() data <- read_tsv(snakemake@input[["combined_report"]]) ##### Get Domain ##### kingdom_data <- data %>% filter(!grepl("p__|g__|s__|c__|o__|f__", `#Classification`)) %>% separate(`#Classification`, c("junk", "kingdom"), sep = "^k__") %>% select(!junk) kingdom_data_long <- kingdom_data %>% pivot_longer(cols = !kingdom, names_to = "study", values_to = "read_number") ##### Get genus ##### genus_data <- data %>% filter(grepl("g__", `#Classification`)) %>% separate(`#Classification`, c("junk", "genus"), sep = genus) %>% select(!junk) %>% filter(!grepl("s__", genus)) genus_data_long <- genus_data %>% pivot_longer(cols = !genus, names_to = "study", values_to = "read_number") ##### Get species ##### species_data <- data %>% filter(grepl("s__", `#Classification`)) %>% separate(`#Classification`, c("junk", "species"), sep = spp) %>% select(!junk) species_data_long <- species_data %>% pivot_longer(cols = !species, names_to = "study", values_to = "read_number") ##### How many reads per study? ##### total_reads_per_study <- species_data_long %>% group_by(study) %>% summarise(total_reads = sum(read_number)) classified_reads_plot <- total_reads_per_study %>% ggplot() + aes(x = study, y = total_reads) + geom_col() + theme_classic() + coord_flip() ##### Species Classification Passed a specific threshold (unfinished) ##### species_data_filtered <- species_data_long %>% filter(read_number > 1000) genus_data_filtered <- genus_data_long %>% filter(read_number > 1000) #### Export Dataframes to TSV files #### write.table(kingdom_data_long, file = snakemake@output[["kingdom_table"]], sep = "\t", row.names = FALSE) write.table(genus_data_long, file = snakemake@output[["genus_table"]], sep = "\t", row.names = FALSE) write.table(species_data_long, file = snakemake@output[["spp_table"]], sep = "\t", row.names = FALSE) write.table(total_reads_per_study, file = snakemake@output[["classified_reads"]], sep = "\t", row.names = FALSE) #### Export Classified Reads Plot #### pdf(snakemake@output[["classified_reads_plot"]]) print(classified_reads_plot) dev.off() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | library(tidyverse) metaphlan_tbl <- read_tsv(snakemake@input[["concat_tbl"]], skip = 1) %>% select(!NCBI_tax_id) #### Get Kingdom #### metaphlan_tbl_king <- metaphlan_tbl %>% filter(!grepl("p__|g__|s__|c__|o__|f__", clade_name)) %>% mutate_at("clade_name", str_replace, "k__", "") metaphlan_tbl_king_long <- metaphlan_tbl_king %>% pivot_longer(cols = !clade_name, names_to = "samples", values_to = "rel_abund") #### Get Genus #### metaphlan_tbl_genus <- metaphlan_tbl %>% filter(grepl("g__", clade_name)) %>% separate(clade_name, c("junk", "genus"), sep = "g__") %>% select(!junk) %>% filter(!grepl("s__", genus)) metaphlan_tbl_genus_long <- metaphlan_tbl_genus %>% pivot_longer(cols = !genus, names_to = "samples", values_to = "rel_abund") #### Get Species #### metaphlan_tbl_spp <- metaphlan_tbl %>% filter(grepl("s__", clade_name)) %>% separate(clade_name, c("junk", "species"), sep = "s__") %>% select(!junk) metaphlan_tbl_spp_long <- metaphlan_tbl_spp %>% pivot_longer(cols = !species, names_to = "samples", values_to = "rel_abund") #### Export Dataframes to TSV files #### write.table(metaphlan_tbl_king_long, file = snakemake@output[["kingdom_table"]], sep = "\t", row.names = FALSE) write.table(metaphlan_tbl_genus_long, file = snakemake@output[["genus_table"]], sep = "\t", row.names = FALSE) write.table(metaphlan_tbl_spp_long, file = snakemake@output[["spp_table"]], sep = "\t", row.names = FALSE) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | library(tidyverse) # Get colour pallette cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") # Set Read Filter filter <- as.character(snakemake@params[["bracken_threshold"]]) filter <- as.numeric(filter) #### Load Data ##### conc_data <- read_tsv(snakemake@input[["conc_input"]], col_names = c("species", "taxa_id", "taxa_lvl", "kraken_assigned_reads", "added_reads", "new_est_reads", "fraction_total", "file_path")) %>% filter(species != "name") conc_data_clean <- conc_data %>% separate(file_path, c("file", "sample"), sep = "out/") %>% separate(sample, c("sample", "junk"), sep = "_bracken") %>% select(sample, species, taxa_id, taxa_lvl, kraken_assigned_reads, added_reads, new_est_reads, fraction_total) %>% mutate_at(c("taxa_id", "kraken_assigned_reads", "added_reads", "new_est_reads", "fraction_total"), as.numeric) #### Added Reads Across Samples #### sample_summary <- conc_data_clean %>% group_by(sample) %>% summarise(kraken_reads = sum(kraken_assigned_reads), added_reads = sum(added_reads), total_new_reads = sum(new_est_reads)) sample_summary_long <- sample_summary %>% pivot_longer(cols = !sample, names_to = "classification_type", values_to = "num_reads") sample_summary_plot <- sample_summary %>% pivot_longer(cols = !sample, names_to = "classification_type", values_to = "num_reads") %>% filter(classification_type != "total_new_reads") %>% ggplot() + aes(reorder(x = sample, desc(num_reads)), y = num_reads, fill = classification_type) + geom_bar(stat = "identity", position = "stack") + coord_flip() + scale_fill_manual(values = cbPalette) + theme_classic(base_size = 25) + xlab("Sample") + ylab("Number of Reads") ggsave(snakemake@output[["added_reads_plot"]], height = 15, width = 20) #### Stratified Heatmap of Relative Abundance Per Per Sample #### conc_data_clean$species <- sub(" ", "_", conc_data_clean$species) conc_data_clean_sep <- conc_data_clean %>% separate(species, into = c("Genus", "Species"), sep = "_") %>% filter(kraken_assigned_reads > filter) conc_data_clean_genus_group <- conc_data_clean_sep %>% group_by(Genus) %>% summarise(n()) plot_stratified_heatmap <- function(){ for(i in conc_data_clean_genus_group$Genus) { print( conc_data_clean_sep %>% filter(Genus == i) %>% ggplot() + aes(x = sample, y = Species) + geom_tile(aes(fill = new_est_reads), linejoin = "round") + scale_fill_gradientn(colours = c("white", "#F0E442", "#E69F00")) + facet_grid(rows = "Genus") + theme_classic(base_size = 25) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + xlab("Sample")) } } pdf(snakemake@output[["bracken_stratified_heatmap"]], height = 20, width = 20) plot_stratified_heatmap() dev.off() ##### Heatmap of Species ##### plot_heatmap_per_sample <- function(data, filter_thresh){ data %>% filter(new_est_reads > filter_thresh) %>% select(sample, species, fraction_total) %>% ggplot() + aes(x = sample, y = species) + geom_tile(aes(fill = log(fraction_total))) + scale_fill_gradientn(colours = c("white", "orange", "blue")) + theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + ggtitle(paste("Log Relative Abundance of Microbial Species filter: Filtered > ", filter_thresh, sep = "")) } pdf(snakemake@output[["bracken_overall_heatmap"]], height = 20, width = 20) plot_heatmap_per_sample(conc_data_clean, filter) dev.off() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | library(tidyverse) library(MetBrewer) # Load Appropriate Columns columns <- c("bin", "marker_lineage", "num_genomes", "num_markers", "num_marker_sets", "completeness", "contamination", "strain_heterogen", "genome_size_bp", "num_ambiguous_bases", "num_scaffolds", "num_contigs", "N50_scaffold", "N50_contig", "mean_scaffold_length", "mean_contig_length", "longest_scaffold", "longest_contig", "GC", "GC_std", "coding_density", "translation_table", "num_predicted_genes", "0", "1", "2", "3", "4", "5+", "sample") # Read Data, skip first row and remove the redundant header columns from previous step bin_stats <- read_tsv(snakemake@input[["checkm_conc_result"]], col_names = columns, skip = 1) %>% filter(bin != "Bin Id") # Conver to Numeric Columns bin_stats$completeness <- as.numeric(bin_stats$completeness) bin_stats$contamination <- as.numeric(bin_stats$contamination) bin_stats$genome_size_bp <- as.numeric(bin_stats$genome_size_bp) bin_stats$num_contigs <- as.numeric(bin_stats$num_contigs) bin_stats$mean_contig_length <- as.numeric(bin_stats$mean_contig_length) bin_stats$N50_contig <- as.numeric(bin_stats$N50_contig) # Get breaks for plotting bin_stats$col <- cut(bin_stats$completeness, breaks = c(-Inf, 80, Inf), labels = c("<80", ">=80")) bin_stats$contig_cut <- cut(bin_stats$num_contigs, breaks = c(-Inf, 250, Inf), labels = c("<=250", ">250")) # Plot Completeness Vs Contamination Per Sample CompletenessVsContam <- bin_stats %>% ggplot() + aes(x = completeness, y = contamination, size = genome_size_bp, col = col) %>% geom_point(alpha = 0.5) + scale_x_continuous(limits = c(0, 100)) + theme_minimal() + scale_color_manual(values = c("blue", "orange")) pdf(snakemake@output[["CompletenessVsContam"]]) print(CompletenessVsContam) dev.off() #### Plot Completeness Vs Contamination Per Sample bin_stats_sample_split <- bin_stats %>% separate(bin, into = c("sample", "bin_num"), sep = "\\.") compvscontam_samp <- bin_stats_sample_split %>% ggplot() + aes(x = completeness, y = contamination, col = sample) %>% geom_point(alpha = 0.8, size = 4) + scale_x_continuous(limits = c(0, 100 )) + geom_vline( xintercept = 80, linetype = "dotted", color = "green", size = 1) + geom_vline(xintercept = 50, linetype = "dotted", color = "orange", size = 1) + theme_bw(base_size = 18) + scale_color_manual(values = met.brewer("Redon", 10, type = "discrete")) + xlab("Completeness") + ylab("Contamination") + labs(colour = "Sample") + theme(legend.position = "bottom", legend.direction = "horizontal", legend.text = element_text(size = 10)) pdf(snakemake@output[["CompletenessVsContam_Per_Sample"]]) print(compvscontam_samp) dev.off() #### Plot Bar Chart with Completeness vs Contamination Per Sample bin_stats_bar <- bin_stats_sample_split %>% unite("sample_bin", sample:bin_num) %>% select(completeness, contamination, sample_bin) %>% pivot_longer(c(completeness, contamination), names_to = "bin_rank", values_to = "value") %>% ggplot() + aes(reorder(x = sample_bin, desc(value)), y = value, fill = bin_rank) + geom_bar(stat = "identity", position = "dodge") + theme_bw(base_size = 15) + coord_flip() + scale_fill_manual(values = met.brewer("Redon", 2, type = "discrete"), labels = c("Completeness", "Contamination")) + geom_hline(yintercept = 80, linetype = "dotted", color = "green", size = 1) + geom_hline(yintercept = 50, linetype = "dotted", color = "orange", size = 1) + xlab("Sample") + ylab("Contamination and Completion Score (%)") + labs(fill = "CheckM Completion vs Contamination Score") + theme(legend.position = "bottom", legend.direction = "vertical") pdf(snakemake@output[["BarChartCompletenessContamination"]]) print(bin_stats_bar) dev.off() #### Plot the number of contigs vs completeness NumContigsVsCompleteness <- bin_stats %>% ggplot() + aes(x = num_contigs, y = completeness, size = mean_contig_length, col = contig_cut) + geom_point(alpha = 0.5) + theme_minimal() + scale_color_manual(values = c("blue", "orange")) pdf(snakemake@output[["NumContigsVsCompleteness"]]) print(NumContigsVsCompleteness) dev.off() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | library(tidyverse) library(MetBrewer) #### Import Kraken Data concatenated_kraken_summaries <- read_tsv(snakemake@input[["concatenated_kraken_summary"]], col_names = c("proportion_of_reads", "number_of_reads", "number_of_reads2", "read_code", "read_code2", "classification_status", "file_path")) concatenated_kraken_summaries_clean <- concatenated_kraken_summaries %>% separate(file_path, into = c("junk", "filename"), sep = "classified_summary/") %>% separate(filename, into = c("sample", "junk2"), sep = "_classified_summary.txt") %>% select(sample, proportion_of_reads, number_of_reads, classification_status) #### Plot Proportion of Classified vs Unclassified Reads from Kraken plot_kraken_proportion <- function(kraken_table){ concatenated_kraken_summaries_clean_plot <- kraken_table %>% ggplot() + aes(x = sample, y = number_of_reads, fill = classification_status) + geom_bar(stat = "identity", position = "fill") + theme_bw(base_size = 15) + scale_fill_manual(values = met.brewer("Redon", n = 2, type = "continuous"), labels = c("Classified Reads", "Unclassified Reads")) + xlab("Sample") + ylab("Proportion of Reads") + theme(legend.title = element_blank(), axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1), legend.position = "top") return(concatenated_kraken_summaries_clean_plot) } kraken_classification_proportions <- plot_kraken_proportion(concatenated_kraken_summaries_clean) #### Save file ggsave(snakemake@output[["classified_proportions"]], kraken_classification_proportions) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | library(tidyverse) library(MetBrewer) ##### Import Data ##### spp_tbl <- read_tsv(snakemake@input[["spp_table"]]) genus_tbl <- read_tsv(snakemake@input[["genus_table"]]) ##### Define User Read Threshold ##### filter_threshold <- as.character(snakemake@params[["strat_thresh"]]) filter_threshold <- as.numeric(filter_threshold) genus_thres <- as.character(snakemake@params[["genus_thresh"]]) genus_thres <- as.numeric(genus_thres) spp_thres <- as.character(snakemake@params[["species_thresh"]]) spp_thres <- as.numeric(spp_thres) ##### Clean and Prep Data ##### genus_tbl_clean <- genus_tbl %>% separate(study, into = c("sample", "junk"), sep = "_") %>% select(!junk) %>% rename("phylo_level" = genus) spp_tbl_clean <- spp_tbl %>% separate(study, into = c("sample", "junk"), sep = "_") %>% select(!junk) %>% rename("phylo_level" = species) ##### Plot Function ##### kraken_heatmap <- function(clean_kraken_table, read_thresh, type){ heatmap <- clean_kraken_table %>% filter(read_number > read_thresh) %>% filter(phylo_level != "Homo_sapiens") %>% ggplot() + aes(x = sample, y = phylo_level) + geom_tile(aes(fill = log(read_number)), linejoin = "round") + scale_fill_gradientn(colours = met.brewer("Hiroshige", 4, type = "discrete")) + theme_bw(base_size = 20) + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1), legend.position = "top", legend.text = element_text(size = 15), legend.title = element_text(size = 15)) + labs(fill = "Read Number (log)") + xlab("Sample") + ylab(type) return(heatmap) } species_heatmap <- kraken_heatmap(spp_tbl_clean, spp_thres, "Species") genus_heatmap <- kraken_heatmap(genus_tbl_clean, genus_thres, "Genus") ggsave(snakemake@output[["genus_heatmap"]], genus_heatmap, height = 20, width = 15) ggsave(snakemake@output[["species_heatmap"]], species_heatmap, height = 20, width = 15) ##### Faceted Plot By Genus ##### spp_tbl_split <- spp_tbl_clean %>% separate(phylo_level, into = c("Genus", "Species1", "Species2"), sep = "_") %>% unite("Species", Species1:Species2, na.rm = TRUE) genus_list <- spp_tbl_split %>% group_by(Genus) %>% summarise(n()) plot_species_facets <- function() { for(i in genus_list$Genus) { print( spp_tbl_split %>% filter(Genus == i) %>% ggplot() + aes(x = sample, y = Species) + geom_tile(aes(fill = read_number), linejoin = "round") + scale_fill_gradientn(colours = c("white", "#F0E442", "#E69F00")) + facet_grid(rows = "Genus") + theme_classic(base_size = 25) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) ) } } pdf(snakemake@output[["stratified_heatmap"]], height = 20, width = 20) plot_species_facets() dev.off() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | library(tidyverse) library(MetBrewer) library(treemapify) ##### Import Data ##### spp_tbl <- read_tsv(snakemake@input[["spp_table"]]) genus_tbl <- read_tsv(snakemake@input[["genus_table"]]) ##### Define User Read Threshold ##### filter_threshold <- as.character(snakemake@params[["strat_thresh"]]) filter_threshold <- as.numeric(filter_threshold) genus_thres <- as.character(snakemake@params[["genus_thresh"]]) genus_thres <- as.numeric(genus_thres) spp_thres <- as.character(snakemake@params[["species_thresh"]]) spp_thres <- as.numeric(spp_thres) ##### Clean and Prep Data ##### genus_tbl_clean <- genus_tbl %>% separate(study, into = c("sample", "junk"), sep = "_") %>% select(!junk) %>% rename("phylo_level" = genus) spp_tbl_clean <- spp_tbl %>% separate(study, into = c("sample", "junk"), sep = "_") %>% select(!junk) %>% rename("phylo_level" = species) ##### Plot Spatial Plots ##### plot_spatial_classification <- function(classificaiton_tbl, read_thresh){ tbl_clean_sum <- classificaiton_tbl %>% group_by(phylo_level) %>% summarise(summed_read_num = sum(read_number)) %>% filter(summed_read_num > read_thresh) spatial_plot <- tbl_clean_sum %>% filter(phylo_level != "Homo_sapiens") %>% filter(summed_read_num > read_thresh) %>% ggplot(aes(area = summed_read_num, fill = summed_read_num, label = phylo_level, subgroup = phylo_level)) + geom_treemap() + geom_treemap_subgroup_border(colour = "white", size = 1) + geom_treemap_text(colour = "white", place = "center", reflow = T) + scale_fill_gradientn(colours = met.brewer("Hiroshige", 10, type = "discrete"), name = "Total Read Number") + theme_minimal(base_size = 15) + theme(legend.position = "right") return(spatial_plot) } species_spatial_plot <- plot_spatial_classification(spp_tbl_clean, spp_thres) genus_spatial_plot <- plot_spatial_classification(genus_tbl_clean, genus_thres) ggsave(snakemake@output[["genus_spatial_plot"]], genus_spatial_plot, height = 20, width = 15) ggsave(snakemake@output[["species_spatial_plot"]], species_spatial_plot, height = 20, width = 15) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") region = snakemake.params.get("region", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell("samtools stats {extra} {snakemake.input} {region} > {snakemake.output} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Patrik Smeds" __copyright__ = "Copyright 2016, Patrik Smeds" __email__ = "patrik.smeds@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if len(snakemake.input) == 0: raise ValueError("A reference genome has to be provided!") elif len(snakemake.input) > 1: raise ValueError("Only one reference genome can be inputed!") # Prefix that should be used for the database prefix = snakemake.params.get("prefix", "") if len(prefix) > 0: prefix = "-p " + prefix # Contrunction algorithm that will be used to build the database, default is bwtsw construction_algorithm = snakemake.params.get("algorithm", "") if len(construction_algorithm) != 0: construction_algorithm = "-a " + construction_algorithm shell( "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path import re from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) # Remove file extension(s) (similar to the internal fastqc approach) base = re.sub("\\.gz$", "", base) base = re.sub("\\.bz2$", "", base) base = re.sub("\\.txt$", "", base) base = re.sub("\\.fastq$", "", base) base = re.sub("\\.fq$", "", base) base = re.sub("\\.sam$", "", base) base = re.sub("\\.bam$", "", base) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell( "fastqc {snakemake.params} -t {snakemake.threads} " "--outdir {tempdir:q} {snakemake.input[0]:q}" " {log}" ) # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path:q} {snakemake.output.html:q}") if snakemake.output.zip != zip_path: shell("mv {zip_path:q} {snakemake.output.zip:q}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | __author__ = "Johannes Köster, Julian de Ruiter" __copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter" __email__ = "koester@jimmy.harvard.edu, julianderuiter@gmail.com" __license__ = "MIT" import tempfile from os import path from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts from snakemake_wrapper_utils.samtools import get_samtools_opts # Extract arguments. extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) sort = snakemake.params.get("sorting", "none") sort_order = snakemake.params.get("sort_order", "coordinate") sort_extra = snakemake.params.get("sort_extra", "") samtools_opts = get_samtools_opts(snakemake) java_opts = get_java_opts(snakemake) index = snakemake.input.idx if isinstance(index, str): index = path.splitext(snakemake.input.idx)[0] else: index = path.splitext(snakemake.input.idx[0])[0] # Check inputs/arguments. if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in { 1, 2, }: raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements") if sort_order not in {"coordinate", "queryname"}: raise ValueError("Unexpected value for sort_order ({})".format(sort_order)) # Determine which pipe command to use for converting to bam or sorting. if sort == "none": # Simply convert to bam using samtools view. pipe_cmd = "samtools view {samtools_opts}" elif sort == "samtools": # Add name flag if needed. if sort_order == "queryname": sort_extra += " -n" # Sort alignments using samtools sort. pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}" elif sort == "picard": # Sort alignments using picard SortSam. pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}" else: raise ValueError(f"Unexpected value for params.sort ({sort})") with tempfile.TemporaryDirectory() as tmpdir: shell( "(bwa mem" " -t {snakemake.threads}" " {extra}" " {index}" " {snakemake.input.reads}" " | " + pipe_cmd + ") {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell n = len(snakemake.input) assert n == 2, "Input must contain 2 (paired-end) elements." extra = snakemake.params.get("extra", "") adapters = snakemake.params.get("adapters", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) assert ( extra != "" or adapters != "" ), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='." shell( "cutadapt" " --cores {snakemake.threads}" " {adapters}" " {extra}" " -o {snakemake.output.fastq1}" " -p {snakemake.output.fastq2}" " {snakemake.input}" " > {snakemake.output.qc} {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | __author__ = "Tessa Pierce" __copyright__ = "Copyright 2018, Tessa Pierce" __email__ = "ntpierce@gmail.com" __license__ = "MIT" import tempfile from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) extra = snakemake.params.get("extra", "") mode = snakemake.params.get("mode") assert mode in [ "genome", "transcriptome", "proteins", ], "invalid run mode: only 'genome', 'transcriptome' or 'proteins' allowed" lineage = lineage_opt = snakemake.params.get("lineage", "") if lineage_opt: lineage_opt = f"--lineage {lineage_opt}" with tempfile.TemporaryDirectory() as tmpdir: dataset_dir = snakemake.input.get("dataset_dir", "") if not dataset_dir: dataset_dir = f"{tmpdir}/dataset" shell( "busco" " --cpu {snakemake.threads}" " --in {snakemake.input}" " --mode {mode}" " {lineage_opt}" " {extra}" " --download_path {dataset_dir}" " --out_path {tmpdir}" " --out output" " {log}" ) if snakemake.output.get("short_txt"): assert lineage, "parameter 'lineage' is required to output 'short_tsv'" shell( "cat {tmpdir}/output/short_summary.specific.{lineage}.output.txt > {snakemake.output.short_txt:q}" ) if snakemake.output.get("short_json"): assert lineage, "parameter 'lineage' is required to output 'short_json'" shell( "cat {tmpdir}/output/short_summary.specific.{lineage}.output.json > {snakemake.output.short_json:q}" ) if snakemake.output.get("full_table"): assert lineage, "parameter 'lineage' is required to output 'full_table'" shell( "cat {tmpdir}/output/run_{lineage}/full_table.tsv > {snakemake.output.full_table:q}" ) if snakemake.output.get("miss_list"): assert lineage, "parameter 'lineage' is required to output 'miss_list'" shell( "cat {tmpdir}/output/run_{lineage}/missing_busco_list.tsv > {snakemake.output.miss_list:q}" ) if snakemake.output.get("out_dir"): shell("mv {tmpdir}/output {snakemake.output.out_dir:q}") if snakemake.output.get("dataset_dir"): shell("mv {dataset_dir} {snakemake.output.dataset_dir:q}") |
Support
- Future updates
Related Workflows





