Score and compare multiple bacterial genome assemblies (Illumina, Nanopore) to reference genome(s)
A snakemake-wrapper for evaluating de novo bacterial genome assemblies, e.g. from Oxford Nanopore (ONT) or Illumina sequencing.
See the preprint here: Snakemake Workflows for Long-read Bacterial Genome Assembly and Evaluation, Preprints.org 2022
The workflow includes the following programs:
Installation
Clone repository, for example:
git clone https://github.com/pmenzel/score-assemblies.git /opt/software/score-assemblies
Create a new conda environment containing all necessary programs:
conda env create -n score-assemblies --file /opt/software/score-assemblies/env/environment.yaml
and activate the environment:
conda activate score-assemblies
Usage
First, prepare a data folder, which must contain subfolders
assemblies/
containing the
assemblies.
Additionally, the sub-folders
references/
and
references-protein/
can contain reference genomes and reference proteins with which the assemblies and predicted proteins will be compared.
For example:
.
├── assemblies
│ ├── example-mtb_flyehq4.fa
│ ├── example-mtb_flyehq4+medaka.fa
│ ├── example-mtb_flyehq.fa
│ ├── example-mtb_flyehq+racon4.fa
│ ├── example-mtb_flyehq+racon4+medaka.fa
│ ├── example-mtb_raven4.fa
│ ├── example-mtb_raven4+medaka.fa
│ ├── example-mtb_raven4+medaka+pilon.fa
│ └── example-mtb_unicycler.fa
├── references
│ └── AL123456.3.fa
└── references-protein └── AL123456.3.faa
NB: The assembly and reference fasta files need to have the
.fa
extension and protein reference fasta files need to have the extension
.faa
.
This is the same folder structure used by ont-assembly-snake , i.e. score-assemblies can be run directly in the same folder.
To run the workflow, e.g. with 20 threads, use this command:
snakemake -s /opt/software/score-assemblies/Snakefile --cores 20 --use-conda
Output files of each program will be written to various folders in
score-assemblies-data/
.
Modules
If no references are supplied, then only ideel and BUSCO are done, otherwise score-assemblies will run these programs on each assembly:
assess_assembly and assess_homopolymers
Each assembly will be compared against each reference genome using the
assess_assembly
and
assess_homopolymers
scripts from
pomoxis
. Additionally to the tables
and plots generated by these programs, summary plots for each reference genome will be plotted
in
score-assemblies-data/pomoxis/<reference>_assess_assembly_all_meanQ.pdf
.
BUSCO
Set the lineage via the snakemake call:
snakemake -s /opt/software/score-assemblies/Snakefile --cores 20 --config busco_lineage=bacillales
If not set, the default lineage
bacteria
will be used.
Available datasets can be listed with
busco --list-datasets
The number of complete, fragmented and missing BUSCOs per assembly is plotted in
score-assemblies-data/busco/busco_stats.pdf
.
dnadiff
Each assembly is compared with each reference and the output files will be
located in
score-assemblies-data/dnadiff/<reference>/<assembly>-dnadiff.report
. The values for
AvgIdentity
(from 1-to-1 alignments) and
TotalIndels
are extracted from these files and are plotted
for each reference in
score-assemblies-data/dnadiff/<reference>_dnadiff_stats.pdf
.
NucDiff
Each assembly is compared with each reference and the output files will be
located in the folder
score-assemblies-data/nucdiff/<reference>/<assembly>-nucdiff/
. The values for
Insertions
,
Deletions
, and
Substitutions
are extracted from the file
results/nucdiff_stat.out
and are plotted
for each reference in
score-assemblies-data/nucdiff/<reference>_nucdiff_stats.pdf
.
QUAST
One QUAST report is generated for each reference genome, containing the results for all assemblies.
The report files are located in
score-assemblies-data/quast/<reference>/report.html
.
ideel
Open reading frames are predicted from each assembly via Prodigal and are
search in the Uniprot sprot database with diamond, retaining the best alignment
for each ORF. For each assembly, the distribution of the ratios between length
of the ORF and the matching database sequence are plotted to
ideel/ideel_uniprot_histograms.pdf
and
ideel/ideel_uniprot_boxplots.pdf
.
Additionally, diamond alignments are done between the predicted ORFs and the supplied reference proteins and ratios are plotted to
score-assemblies-data/ideel/<reference>_ideel_histograms.pdf
and
score-assemblies-data/ideel/<reference>_ideel_boxplots.pdf
.
bakta
bakta is only run when specified as extra config argument in the snakemake call:
snakemake -s /opt/software/score-assemblies/Snakefile --cores 20 --use-conda --config bakta=1
The bakta outfiles files are written to in the folder
score-assemblies-data/bakta/<assembly>/
.
NB: It takes a long time to download the bakta database and run bakta on all assemblies.
Summary report
All measurements are summarized in a HTML page in
score-assemblies-report.html
.
Example report
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | suppressPackageStartupMessages(library(tidyverse)) dir <- ifelse(!is.na(snakemake@params[['out_dir']]), snakemake@params[['out_dir']], '.') filename_assess_assembly_all_scores_tsv <- paste0(dir, "/pomoxis/assess_assembly_all_scores.tsv") filename_assess_assembly_all_meanQ_pdf <- "assess_assembly_all_meanQ.pdf" # plot assess_assembly mean Q scores for each assembly and reference df <- read_tsv(filename_assess_assembly_all_scores_tsv, col_names = c("assembly", "reference", "Qscore", "percErr"), col_types = "ffdc" ) for (i_ref in unique(df$reference)) { df.plot <- df %>% filter(reference == i_ref) p <- df.plot %>% ggplot(aes(y = reorder(assembly, Qscore, FUN = max), x = Qscore)) + geom_point(shape = 21, size = 2, fill = "deepskyblue3") + theme_bw() + theme(legend.position = "bottom") + labs(title = "pomoxis / assess-assembly", subtitle = paste("Reference:", i_ref)) + ylab("") + xlab("Q-score ") + theme(axis.text = element_text(size = rel(0.75))) filename_out <- paste0(dir, "/pomoxis/", i_ref, "_", filename_assess_assembly_all_meanQ_pdf) writeLines(paste("Saving plot to", filename_out)) ggsave(p, filename = filename_out) } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | suppressPackageStartupMessages(library(tidyverse)) dir <- ifelse(!is.na(snakemake@params[['out_dir']]), snakemake@params[['out_dir']], '.') filename_busco_stats_tsv <- paste0(dir, "/busco/all_stats.tsv") filename_busco_pdf <- paste0(dir, "/busco/busco_stats.pdf") # plot busco df <- read_tsv(filename_busco_stats_tsv, col_names = c("assembly", "Complete", "Fragmented", "Missing", "n"), col_types = "fdddd" ) %>% gather(key = "measure", value = "percent", -assembly, -n) p <- df %>% ggplot(aes(y = reorder(assembly, percent, max), x = percent)) + geom_text(data = filter(df, measure == "Complete"), aes(color = measure, label = percent), size = 2.5, hjust=1.4) + geom_point(size = 2, aes(color = measure)) + theme_bw() + theme(legend.position = "bottom") + labs(title = "BUSCO") + xlab("Percent") + ylab("") + xlim(0, 100) + theme(axis.text.y = element_text(size = rel(0.75))) + scale_color_discrete(guide = guide_legend(title = "")) filename_out <- filename_busco_pdf writeLines(paste("Saving plot to", filename_out)) ggsave(p, filename = filename_out) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(tidytext)) dir <- ifelse(!is.na(snakemake@params[['out_dir']]), snakemake@params[['out_dir']], '.') filename_dnadiff_stats_tsv <- paste0(dir, "/dnadiff/all_stats.tsv") filename_suffix_dnadiff_stats_tsv <- "_dnadiff_stats.pdf" # plot dnadiff stats for each reference df <- read_tsv(filename_dnadiff_stats_tsv, col_names = c("assembly", "reference", "measure", "value", "value2"), col_types = "fffdd" ) for (i_ref in unique(df$reference)) { df.plot <- df %>% filter(reference == i_ref) p <- df.plot %>% group_by(measure) %>% mutate(assembly = fct_reorder(assembly, value)) %>% ungroup() %>% ggplot(aes(y = assembly, x = value)) + geom_point(shape = 21, size = 2, fill = "deepskyblue3") + facet_wrap(~ measure, scales = "free_x") + theme_bw() + theme(legend.position = "bottom") + labs(title = "dnadiff", subtitle = paste("Reference:", i_ref)) + ylab("") + xlab("") + theme(axis.text.y = element_text(size = rel(0.75))) filename_out <- paste0(dir, "/dnadiff/", i_ref, filename_suffix_dnadiff_stats_tsv) writeLines(paste("Saving plot to", filename_out)) ggsave(p, filename = filename_out) } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | suppressPackageStartupMessages(library(tidyverse)) dir <- ifelse(!is.na(snakemake@params[['out_dir']]), snakemake@params[['out_dir']], '.') # diamond for uniprot ------------------------------------------------------------------------------------- dir_ideel_diamond <- paste0(dir, "/ideel/diamond/") filename_ideel_hist_pdf <- paste0(dir, "/ideel/ideel_uniprot_histograms.pdf") filename_ideel_box_pdf <- paste0(dir, "/ideel/ideel_uniprot_boxplots.pdf") filelist <- list.files(path = dir_ideel_diamond, pattern = ".*\\.tsv", recursive = FALSE, full.names = FALSE) df_list <- vector("list", length(filelist)) for (i in seq_along(filelist)) { filename <- paste0(dir_ideel_diamond, filelist[[i]]) df <- read_tsv(filename, col_names = c("qseqid", "sseqid", "qlen", "slen"), col_types = "ccii" ) %>% mutate(assembly = filelist[[i]]) %>% mutate(assembly = str_remove(assembly, "\\.tsv")) %>% mutate(SCov = qlen / slen) df_list[[i]] <- df } df.all <- dplyr::bind_rows(df_list) %>% mutate(assembly = factor(assembly)) p <- ggplot(df.all, aes(x = SCov)) + geom_histogram(aes(color=assembly, fill=assembly), alpha = 0.8, position = "identity", binwidth = 0.01, size=0.2) + #geom_density(aes(fill = assembly), alpha = 0.4, color = "grey40") + facet_wrap(~assembly) + theme_bw() + ggtitle(paste("ideel with Uniprot Sprot")) + ylab("") + xlab("qlen / slen") + xlim(0.5, 1.5) + scale_color_discrete(guide = FALSE) + scale_fill_discrete(guide = FALSE) + theme(strip.text.x = element_text(size = 6)) + theme(axis.text = element_text(size = rel(0.75))) filename_out <- filename_ideel_hist_pdf writeLines(paste("Saving plot to", filename_out)) ggsave(p, filename = filename_out) p.box <- df.all %>% mutate(assembly = fct_reorder(assembly, SCov, mean)) %>% ggplot(aes(y = assembly, x = SCov)) + geom_boxplot(outlier.shape = NA) + #geom_histogram(aes(color=assembly, fill=assembly), alpha = 0.8, position = "identity", binwidth = 0.01, size=0.2) + #geom_density(aes(fill = assembly), alpha = 0.4, color = "grey40") + #facet_wrap(~assembly) + theme_bw() + ggtitle(paste("ideel with Uniprot Sprot")) + ylab("") + xlab("qlen / slen") + xlim(0.5, 1.5) + theme(axis.text = element_text(size = rel(0.75))) filename_out <- filename_ideel_box_pdf writeLines(paste("Saving plot to", filename_out)) ggsave(p.box, filename = filename_out) # diamond for references -------------------------------------------------------------------------------- dir_ideel_diamond_ref <- paste0(dir, "/ideel/diamond-ref/") ref_list <- list.dirs(path = dir_ideel_diamond_ref, recursive = FALSE, full.names = FALSE) for (r in seq_along(ref_list)) { filelist <- list.files(path = paste0(dir_ideel_diamond_ref,"/", ref_list[[r]]), pattern = ".*\\.tsv", recursive = FALSE, full.names = FALSE) df_list <- vector("list", length(filelist)) for (i in seq_along(filelist)) { filename <- paste0(dir_ideel_diamond_ref, "/", ref_list[[r]], "/", filelist[[i]]) df <- read_tsv(filename, col_names = c("qseqid", "sseqid", "qlen", "slen"), col_types = "ccii") %>% mutate(assembly = filelist[[i]]) %>% mutate(assembly = str_remove(assembly, "\\.tsv")) %>% mutate(assembly = str_remove(assembly, paste0("_",ref_list[[r]]))) %>% mutate(SCov = qlen / slen) df_list[[i]] <- df } df.all <- dplyr::bind_rows(df_list) %>% mutate(assembly = factor(assembly)) p <- ggplot(df.all, aes(x = SCov)) + geom_histogram(aes(color=assembly, fill=assembly), alpha = 0.8, position = "identity", binwidth = 0.01, size=0.2) + #geom_density(aes(fill = assembly), alpha = 0.4, color = "grey40") + facet_wrap(~assembly, scales = "free_x") + theme_bw() + labs(title = "ideel", subtitle = paste("Reference:", ref_list[[r]])) + ylab("") + xlab("qlen / slen") + xlim(0.5, 1.5) + scale_color_discrete(guide = FALSE) + scale_fill_discrete(guide = FALSE) + theme(strip.text.x = element_text(size = 6)) + theme(axis.text = element_text(size = rel(0.75))) filename_pdf <- paste0(dir, "/ideel/", ref_list[[r]], "_ideel_histograms.pdf") writeLines(paste("Saving plot to", filename_pdf)) ggsave(p, filename = filename_pdf) p.box <- df.all %>% mutate(assembly = fct_reorder(assembly, SCov, mean)) %>% ggplot(aes(y = assembly, x = SCov)) + geom_boxplot(outlier.shape = NA) + theme_bw() + labs(title = "ideel", subtitle = paste("Reference:", ref_list[[r]])) + ylab("") + xlab("qlen / slen") + xlim(0.5, 1.5) + theme(axis.text.y = element_text(size = rel(0.75))) filename_pdf <- paste0(dir, "/ideel/", ref_list[[r]], "_ideel_boxplots.pdf") writeLines(paste("Saving plot to", filename_pdf)) ggsave(p.box, filename = filename_pdf) } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(tidytext)) dir <- ifelse(!is.na(snakemake@params[['out_dir']]), snakemake@params[['out_dir']], '.') filename_nucdiff_stats_tsv <- paste0(dir, "/nucdiff/all_stats.tsv") filename_suffix_nucdiff_stats_tsv <- "_nucdiff_stats.pdf" # plot nucdiff stats for each reference df <- read_tsv(filename_nucdiff_stats_tsv, col_names = c("assembly", "reference", "measure", "value"), col_types = "fffd" ) %>% mutate(measure = fct_relevel(measure, "Substitutions")) for (i_ref in unique(df$reference)) { df.plot <- df %>% filter(reference == i_ref) p <- df.plot %>% filter(reference == i_ref) %>% group_by(measure) %>% mutate(assembly = fct_reorder(assembly, -value)) %>% ungroup() %>% ggplot(aes(y = assembly, x = value)) + #geom_line(aes(group = reference), color = "grey70", size = 1) + geom_point(shape = 21, size = 2, fill = "deepskyblue3") + facet_wrap(~measure, scales = "free_x") + theme_bw() + theme(legend.position = "bottom") + labs(title = "nucdiff", subtitle = paste("Reference:", i_ref)) + ylab("") + xlab("") + theme( panel.grid.major.x = element_blank(), axis.text = element_text(size = rel(0.75)) ) filename_out <- paste0(dir, "/nucdiff/", i_ref, filename_suffix_nucdiff_stats_tsv) writeLines(paste("Saving plot to", filename_out)) ggsave(p, filename = filename_out) } |
187 188 189 190 191 192 193 | shell: """ assess_assembly -r {input.ref} -i {input.assembly} -p {out_dir}/pomoxis/{wildcards.id}/assess_assembly/{wildcards.id}_{wildcards.ref} >{log} 2>&1 meanQ=$(grep -A2 '# Q Scores' {output.summ} | tail -n1 | awk '{{print $2}}') percErr=$(grep -A2 '# Percentage Errors' {output.summ} | tail -n1 | awk '{{print $2}}') echo "{wildcards.id}\t{wildcards.ref}\t$meanQ\t$percErr" > {output.tsv} """ |
207 208 209 210 | shell: """ minimap2 -x asm5 -t {threads} --MD -a {input.ref} {input.assembly} 2>{log} | samtools sort -o {output} - >>{log} 2>&1 """ |
237 238 239 240 241 242 243 244 245 | shell: """ rm -rf {params.count_dir} {params.analyse_dir} assess_homopolymers count -o {params.count_dir} {input} >{log} 2>&1 assess_homopolymers analyse -o {params.analyse_dir} {output.hp_count} >>{log} 2>&1 grep -v count {output.rel_len} | sed 's/^/{wildcards.id}\t{wildcards.ref}\t/' > {output.rel_len2} grep -v rlen {output.correct_len} | sed 's/^/{wildcards.id}\t{wildcards.ref}\t/' > {output.correct_len2} """ |
257 258 259 260 261 262 263 264 | shell: """ #filter inf values, which happen when comparing identical assemblies cat {input.aa_meanQ} | grep -v -w 'inf' > {output.all_aa_meanQ} cat {input.hp_rel_len} > {output.all_hp_rel_len} cat {input.hp_correct_len} > {output.all_hp_correct_len} """ |
284 285 286 287 | shell: """ cd {out_dir}/busco && busco -q -c {threads} -f -m genome -l {busco_lineage} -o {wildcards.id} -i ../../{input} >../../{log} 2>&1 """ |
296 297 298 299 | shell: """ perl -lne 'print "{wildcards.id}\t$1\t$2\t$3\t$4" if /C:([\-\d\.]+).*F:([\-\d\.]+).*M:([\-\d\.]+).*n:(\d+)/' {input} > {output} """ |
307 308 309 310 | shell: """ cat {input} > {output} """ |
329 330 331 332 | shell: """ quast -t {threads} --glimmer -o {out_dir}/quast/{wildcards.ref} -r {input.reference} {input.fa} >{log} 2>&1 """ |
352 353 354 355 356 357 | shell: """ dnadiff -p {out_dir}/dnadiff/{wildcards.ref}/{wildcards.id}-dnadiff {input.reference} {input.assembly} >{log} 2>&1 cat {output.dnadiff_report} | grep -A3 '1-to-1' | grep 'AvgIdentity' | sed -e 's/^/{wildcards.id}\t{wildcards.ref}\t/' | perl -lpne 's/\s+/\t/g' > {output.stats_tsv} grep TotalIndels {output.dnadiff_report} | sed -e 's/^/{wildcards.id}\t{wildcards.ref}\t/' | perl -lpne 's/\s+/\t/g' >> {output.stats_tsv} """ |
365 366 367 368 | shell: """ cat {input} > {output} """ |
390 391 392 393 394 | shell: """ nucdiff {input.reference} {input.assembly} {params.nucdiff_dir} nucdiff >{log} 2>&1 cat {output.nucdiff_stat} | grep 'Insertions\|Deletions\|Substitutions' | sed -e 's/^/{wildcards.id}\t{wildcards.ref}\t/' > {output.nucdiff_tsv} """ |
402 403 404 405 | shell: """ cat {input} > {output} """ |
419 420 421 422 | shell: """ wget -P {out_dir}/ideel/uniprot http://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz >{log} 2>&1 """ |
434 435 436 437 | shell: """ diamond makedb --db {out_dir}/ideel/uniprot/uniprot_sprot --in {input} >{log} 2>&1 """ |
449 450 451 452 453 454 | shell: """ prodigal -a {output} -i {input} >{log} 2>&1 # remove stop codon (*) from end of sequences sed -i 's/*$//' {output} """ |
467 468 469 470 | shell: """ seqkit stats -T -t protein {input} >{output} 2>{log} """ |
484 485 486 487 | shell: """ diamond blastp --threads {threads} --max-target-seqs 1 --db {input.db} --query {input.proteins} --outfmt 6 qseqid sseqid qlen slen --out {output} >{log} 2>&1 """ |
501 502 503 504 505 506 | shell: """ wget -N -P {out_dir}/bakta https://zenodo.org/record/7669534/files/db-light.tar.gz >{log} 2>&1 tar --directory {out_dir}/bakta -xf {out_dir}/bakta/db-light.tar.gz >{log} 2>&1 amrfinder_update --force_update --database {out_dir}/bakta/db-light/amrfinderplus-db/ >{log} 2>&1 """ |
520 521 522 523 | shell: """ bakta --db {out_dir}/bakta/db-light --verbose --output {out_dir}/bakta/{wildcards.id} --prefix {wildcards.id} --threads {threads} {input.fa} >{log} 2>&1 """ |
540 541 542 543 | shell: """ diamond makedb --db {output} --in {input} >{log} 2>&1 """ |
557 558 559 560 | shell: """ diamond blastp --threads {threads} --max-target-seqs 1 --db {input.db} --query {input.proteins} --outfmt 6 qseqid sseqid qlen slen --out {output} >{log} 2>&1 """ |
577 578 | script: "scripts/plot-assess_assembly.R" |
599 600 | script: "scripts/plot-busco.R" |
612 613 | script: "scripts/plot-dnadiff.R" |
625 626 | script: "scripts/plot-nucdiff.R" |
641 642 | script: "scripts/plot-ideel.R" |
664 665 666 667 | shell: """ Rscript -e 'args<-commandArgs(trailingOnly = TRUE); rmarkdown::render(args[1], output_file=args[3], knit_root_dir=args[4])' {report_rmd} {out_dir} {params.wd}/{output} {params.wd} >{log} 2>&1 """ |
Support
- Future updates
Related Workflows





