NanoClass: A Taxonomic Meta-Classifier for Oxford Nanopore Amplicon Sequencing Data
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
NanoClass is a taxonomic meta-classifier for 16S/18S amplicon sequencing data generated with the Oxford Nanopore MinION. With a single command, you can run upto 11 popular classification tools on multiple samples in parallel, including BLASTN, Centrifuge, Kraken2, IDTAXA, MegaBLAST, dcMegaBLAST, Minimap2, Mothur, QIIME2, RDP and SPINGO. Optional read preparation steps, such as quality trimming, length filtering and sub-sampling, are an integral part of the pipeline.
NanoClass automatically installs all software packages and dependencies, downloads and builds required taxonomic databases and runs the analysis on your samples.
Getting started
Requirements
The entire NanoClass workflow can be run on a powerfull desktop computer, but for many applications a laptop will do. Most classification tools implemented in NanoClass run in a matter of minutes to hours. Prerequisites are Conda and Snakemake .
NanoClass automatically installs all other software packages and dependencies.
Installation
You can either clone NanoClass, like so:
git clone https://github.com/ejongepier/NanoClass.git
Or download and extract the zip archive from https://github.com/ejongepier/NanoClass.
NanoClass is immediately ready for use. See also the Documentation .
Usage
Quick start
Enter your samples and the paths to your fastq.gz files in the sample.csv. Sample labels should be unique. Both sample and run labels should contain letters and numbers only. Barcode column should be left empty, meaning your input files should already be demultiplexed. For an example see the sample.csv file.
After editing the samples.csv, the entire pipeline can be run with a single command:
snakemake --use-conda --cores <ncores>
Where
--cores
are the number of CPU cores/jobs that can be run in parallel on your system.
Customizing
You can customize the pipeline through the config.yaml, e.g. by running only a subset of the reads or classification tools, or by changing the default Silva 16S taxonomic database.
For details on how to customize NanoClass, see the
Documentation
.
Report
After successful execution, you can create an interactive HTML report with:
snakemake --report report/NanoClass.zip
Authors
- Evelien Jongepier (e.jongepier@uva.nl)
Code Snippets
13 14 15 16 17 | shell: """ makeblastdb -in {input} -parse_seqids \ -dbtype nucl 2>&1 | tee -a {log} """ |
32 33 34 35 36 | shell: """ fasta-splitter --n-parts {params.n_chunks} {input} \ --nopad --out-dir {params.out_dir} """ |
59 60 61 62 63 64 65 | shell: """ blastn -task 'blastn' -db {input.db} -evalue {params.eval} \ -perc_identity {params.pident} -max_target_seqs {params.nseqs} \ -query {input.fasta} -out {output.all} -outfmt 6 \ 2> {log} """ |
88 89 90 91 92 93 94 95 96 | shell: """ cat {input.out} | awk -F '\\t' -v l={params.alnlen} \ '{{if ($4 > l) print $0}}' \ > {output.out} cat {input.benchm} | grep -P "^[0-9]" | \ awk '{{sum+=$1}} END {{print "s"; print sum/{params.threads}}}' \ > {output.benchm} 2> {log} """ |
113 114 115 116 117 | shell: """ scripts/tolca.py -b {input.blast} -t {input.db} \ -l {output} -c {params.lcacons} > {log} """ |
132 133 | shell: "scripts/tomat.py -l {input} 2> {log}" |
22 23 24 25 26 27 28 29 30 | shell: """ centrifuge-download -o db/centrifuge/taxonomy taxonomy > {log} 2>&1 wget {params.map_url} -q -O - | gzip -d -c - | \ awk '{{print $1\".\"$2\".\"$3\"\t\"$(NF)}}' \ > {output.tax_map} 2>> {log} scripts/todb.py -s {input.seq} -t {input.tax} -m centrifuge \ -S {output.ref_seqs} -T {output.ref_tax} 2>> {log} """ |
57 58 59 60 61 62 63 64 65 | shell: """ centrifuge-build \ --threads {threads} \ --conversion-table {input.conversion_table} \ --taxonomy-tree {input.tax_tree} \ --name-table {input.name_table} \ {input.ref_seqs} {params.prefix} > {log} 2>&1 """ |
89 90 91 92 93 94 95 96 97 | shell: """ centrifuge -x {params.index_prefix} \ -U {input.fastq} \ --threads {threads} \ --report-file {output.report} \ -S {output.classification} \ --met-stderr > {log} 2>&1 """ |
115 116 | shell: "scripts/tomat.py -c {input.out} -f {input.ref_seqs} 2> {log}" |
18 19 20 21 22 23 24 25 26 27 28 29 30 31 | shell: """ wget -q -O db/common/db.zip {params.url} unzip -q -p -j db/common/db.zip \ */taxonomy/{params.ssu}_only/99/majority_taxonomy_7_levels.txt \ > {output.ref_tax} 2> {log} unzip -q -p -j db/common/db.zip \ */rep_set/rep_set_{params.ssu}_only/99/silva_*_99_{params.ssu}.fna \ > {output.ref_seqs} 2>> {log} unzip -q -p -j db/common/db.zip \ */rep_set_aligned/99/99_alignment.fna.zip > tmp.aln.zip unzip -q -p tmp.aln.zip > {output.ref_aln} 2>> {log} rm db/common/db.zip tmp.aln.zip 2>> {log} """ |
66 67 68 69 70 | shell: """ mkdir -p ./plots Rscript scripts/barplot.R {input} 2> {log} """ |
88 89 | shell: "scripts/toconsensus.py -l {input} 2> {log}" |
106 107 108 109 110 | shell: """ mkdir -p ./plots Rscript scripts/lineplot.R {input} 2> {log} """ |
132 133 134 135 136 137 | shell: """ mkdir -p ./plots mkdir -p ./tables Rscript scripts/timeplot.R {input} 2> {log} """ |
20 21 22 23 24 | shell: """ fasta-splitter --n-parts {params.n_chunks} {input} \ --nopad --out-dir {params.out_dir} """ |
46 47 48 49 50 51 52 | shell: """ blastn -task 'dc-megablast' -db {input.db} -evalue {params.eval} \ -perc_identity {params.pident} -max_target_seqs {params.nseqs} \ -query {input.fasta} -out {output.all} -outfmt 6 \ 2> {log} """ |
76 77 78 79 80 81 82 83 84 | shell: """ cat {input.out} | awk -F '\\t' -v l={params.alnlen} \ '{{if ($4 > l) print $0}}' \ > {output.out} cat {input.benchm} | grep -P "^[0-9]" | \ awk '{{sum+=$1}} END {{print "s"; print sum / {params.threads}}}' \ > {output.benchm} 2> {log} """ |
102 103 104 105 106 | shell: """ scripts/tolca.py -b {input.blast} -t {input.db} \ -l {output} -c {params.lcacons} > {log} """ |
121 122 | shell: "scripts/tomat.py -l {input} 2> {log}" |
16 17 | shell: "Rscript scripts/learntaxa.R {input.ref_seqs} {input.ref_tax} {output} 2> {log}" |
39 40 41 42 43 | shell: """ Rscript scripts/idtaxa.R {input.db} {input.query} {output.tmp} {threads} {params} 2> {log} awk 'BEGIN{{FS=OFS="\\t"}} {{for (i=1; i<=7; i++) if ($i ~ /^ *$/) $i="NA"}} 1' {output.tmp} > {output.out} """ |
59 60 | shell: "scripts/tomat.py -l {input.list} 2> {log}" |
20 21 22 23 24 | shell: """ kraken2-build --db {params.db} --special {params.db_type} \ --threads {threads} > {log} 2> {log} """ |
45 46 47 48 49 50 51 52 | shell: """ kraken2 --db {params.db_dir} \ --output {output.out} \ --report {output.report} \ --gzip-compressed \ --threads {threads} {input.fastq} 2> {log} """ |
71 72 73 74 75 | shell: """ scripts/tomat.py -k {input.kraken_out} -f {input.silva_seqs} \ -m {input.kraken_map} 2> {log} """ |
18 19 20 21 22 | shell: """ mapseq -nthreads {threads} {input.query} {input.ref_seqs} \ {input.ref_tax} > {output} 2> {log} """ |
37 38 | shell: "scripts/tomat.py -b {input.out} -t {input.db} 2> {log}" |
20 21 22 23 24 | shell: """ fasta-splitter --n-parts {params.n_chunks} {input} \ --nopad --out-dir {params.out_dir} """ |
46 47 48 49 50 51 52 | shell: """ blastn -task 'megablast' -db {input.db} -evalue {params.eval} \ -perc_identity {params.pident} -max_target_seqs {params.nseqs} \ -query {input.fasta} -out {output.all} -outfmt 6 \ 2> {log} """ |
76 77 78 79 80 81 82 83 84 | shell: """ cat {input.out} | awk -F '\\t' -v l={params.alnlen} \ '{{if ($4 > l) print $0}}' \ > {output.out} cat {input.benchm} | grep -P "^[0-9]" | \ awk '{{sum+=$1}} END {{print "s"; print sum / {params.threads}}}' \ > {output.benchm} 2> {log} """ |
102 103 104 105 106 | shell: """ scripts/tolca.py -b {input.blast} -t {input.db} \ -l {output} -c {params.lcacons} > {log} """ |
121 122 | shell: "scripts/tomat.py -l {input} 2> {log}" |
20 21 22 23 24 | shell: """ minimap2 {params.extra} -t {threads} -N {params.nseqs} \ {input.target} {input.query} -o {output} 2> {log} """ |
39 40 41 42 43 44 | shell: """ samtools sort -@ {threads} {input} | \ samtools view -@ {threads} | \ cut -f 1,3 > {output} 2> {log} """ |
61 62 63 64 65 | shell: """ scripts/tolca.py -b {input.mm} -t {input.db} \ -l {output} -c {params.lcacons} > {log} """ |
81 82 | shell: "scripts/tomat.py -l {input} 2> {log}" |
17 18 19 20 21 | shell: """ scripts/todb.py -s {input.aln} -t {input.tax} -m mothur \ -S {output.aln} -T {output.tax} 2> {log} """ |
42 43 44 45 46 47 48 49 50 51 52 53 | shell: """ mkdir -p {output.dir} cp {input} {output.dir} queryf=$(basename {input.query}) alnf=$(basename {input.aln}) echo \"align.seqs(candidate={output.dir}/$queryf, template={output.dir}/$alnf, \ processors={threads}, ksize=6, align=needleman); -q\" > {output.dir}/mothur.cmd mothur {output.dir}/mothur.cmd 2> {log} awk -F '\\t' -v OFS='\\t' '{{if (NR==1) printf "%s","#"; print $1, $3}}' \ {output.dir}/{params.file} > {output.out} 2>> {log} """ |
71 72 | shell: "scripts/tomat.py -b {input.out} -t {input.tax} 2> {log}" |
29 30 31 32 33 34 35 36 | shell: """ porechop --input {input} \ --output {output} \ --threads {threads} \ --check_reads {params.check_reads} \ --discard_middle > {log} """ |
57 58 59 60 61 62 | shell: """ gzip -d -c {input} | \ NanoFilt --quality {params.quality} --length {params.min_len} --maxlength {params.max_len} | \ gzip > {output} 2> {log} """ |
81 82 83 84 85 | shell: """ seqtk sample -s {params.seed} {input} {params.n} | \ gzip -c > {output} """ |
97 98 | shell: "zcat < {input} | sed -n '1~4s/^@/>/p;2~4p' > {output}" |
121 122 123 124 125 | shell: """ pistis --fastq {input} --output {output} \ --downsample {params.downsample} 2> {log} """ |
143 144 145 146 147 | shell: """ NanoStat --fastq {input} --name {output} \ --threads {threads} 2> {log} """ |
14 15 16 17 18 19 20 21 | shell: """ qiime tools import \ --type 'FeatureData[Sequence]' \ --input-path {input.fasta} \ --output-path {output} \ 2>&1 | tee -a {log} """ |
35 36 37 38 39 40 41 42 | shell: """ qiime tools import \ --type 'FeatureData[Sequence]' \ --input-path {input} \ --output-path {output} \ 2>&1 | tee -a {log} """ |
56 57 58 59 60 61 62 63 64 | shell: """ qiime tools import \ --type FeatureData[Taxonomy] \ --input-format HeaderlessTSVTaxonomyFormat \ --input-path {input} \ --output-path {output} \ 2>&1 | tee -a {log} """ |
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | shell: """ qiime feature-classifier classify-consensus-vsearch \ --i-query {input.query} \ --i-reference-reads {input.ref_seqs} \ --i-reference-taxonomy {input.taxo} \ --o-classification {output.qza} \ --p-threads {threads} \ --p-perc-identity {params.pident} \ --p-maxaccepts {params.nseqs} \ --p-min-consensus {params.lcacons} \ 2>&1 | tee -a {log} qiime tools export \ --input-path {output.qza} \ --output-path {output.path} 2>&1 | tee -a {log} mv {output.path}/taxonomy.tsv {output.out} 2>> {log} """ |
118 119 120 121 122 123 124 125 126 | shell: """ cut -f 1,2 {input} | grep -P -v '\\tUnassigned$' | \ awk -F '\\t|;' -v OFS='\\t' '{{NF=7}}1' | \ awk 'BEGIN {{ FS = OFS = "\\t" }} {{ for(i=1; i<=NF; i++) if($i == "") $i = "NA" }}; 1' | \ sed "s/D_[[:digit:]]__//g" | \ sed "s/^Feature ID.*/\#readid\\tDomain\\tPhylum\\tClass\\tOrder\\tFamily\\tGenus/" \ > {output} 2> {log} """ |
139 140 | shell: "scripts/tomat.py -l {input.list} 2> {log}" |
15 16 17 18 19 | shell: """ scripts/todb.py -s {input.seq} -t {input.tax} -m rdp -S tmp.seq -T {output.tax} 2> {log} gzip -c tmp.seq > {output.seq} && rm tmp.seq 2>> {log} """ |
40 41 | shell: "Rscript scripts/assigntaxonomy.R {input.db} {input.query} {output} {threads} {params} 2> {log}" |
57 58 | shell: "scripts/tomat.py -l {input.list} 2> {log}" |
18 19 20 21 22 23 | shell: """ scripts/todb.py -s {input.seq} -t {input.tax} -m spingo \ -S {output.seq} -T {output.tax} 2> {log} spindex -k 8 -p {threads} -d {output.seq} 2>> {log} """ |
43 44 45 46 47 48 49 50 51 52 53 54 55 | shell: """ spingo -d {input.db} -k 8 -a \ -p {threads} -i {input.fasta} \ > {output.tmp} 2> {log} sed 's/ /\\t/' {output.tmp} | \ awk -F '\\t' -v OFS='\\t' '{{ if (NR == 1) print "#readid","Domain","Phylum","Class","Order","Family","Genus"; else print $1, $4, $6, $8, $10, $12, $14 }}' > {output.out} 2> {log} """ |
71 72 | shell: "scripts/tomat.py -l {input.list} 2> {log}" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | rm(list=ls()) suppressPackageStartupMessages(library(dada2)) suppressPackageStartupMessages(library(seqinr)) args = commandArgs(trailingOnly=TRUE) threads = as.numeric(args[4]) pident = as.numeric(args[5]) query <- read.fasta(args[2], as.string = T, forceDNAtolower = F, set.attributes = F) set.seed(100) taxann <- assignTaxonomy(seqs = unlist(query), refFasta = args[1], tryRC = T, outputBootstraps=F, minBoot = pident, multithread = threads, verbose = F, taxLevels = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus") ) # remove lengthy seqs from row names queryid <- names(strsplit(rownames(taxann), " ")) taxodf <- as.data.frame(taxann) for (i in 1:length(taxodf)) taxodf[,i] <- gsub('_', ' ', taxodf[,i]) names(taxodf)[1] <- "Domain" taxodf$"#readid" <- queryid write.table(taxodf[,c(7,1:6)], file = args[3], row.names=F, col.names=T, quote=F, sep='\t') |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | suppressPackageStartupMessages(library(phyloseq)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(vroom)) suppressPackageStartupMessages(library(tidyr)) args = commandArgs(trailingOnly=TRUE) taxmat <- as.data.frame(unique(vroom(delim = '\t', args))) taxmat <- taxmat[!is.na(taxmat$Domain),] taxmat <- unique(taxmat) names(taxmat)[1] <- "taxid" write.table(taxmat, file="tables/taxonomy-table.tsv", row.names=F, col.names=T, sep='\t', quote=F) rownames(taxmat) <- taxmat$taxid taxmat$taxid <- NULL taxmat <- as.matrix(taxmat) file <- gsub(".taxmat$", ".otumat", args) sam <- data.frame(run = rep(NA, length(file)), sample = rep(NA, length(file)), method = rep(NA, length(file))) for (i in 1:length(file)){ otumatje <- read.table(file[i], header = T, sep = '\t', comment = "") names(otumatje)[1] <- "taxid" names(otumatje)[2] <- paste0(strsplit(file[i], "/")[[1]][2],"_",names(otumatje)[2]) ifelse (i == 1, otumat <- otumatje, otumat <- merge(otumat, otumatje, all=TRUE)) lab = strsplit(file[i], "/")[[1]][4] sam$run[i] = strsplit(file[i], "/")[[1]][2] sam$sample[i] = strsplit(lab, "[.]")[[1]][1] sam$method[i] = strsplit(lab, "[.]")[[1]][2] } otumat[is.na(otumat)] <- 0 ## for some custom DB like BOLD local copy, the same taxonomic lineage may occur 2x. ## This causes an error when they are used as rownames, which should be unique --> fix: aggregate otumat <- aggregate(otumat[2:length(otumat)], by=list(otumat$taxid), sum) names(otumat)[1] <- "taxid" write.table(otumat, file="tables/otu-table.tsv", row.names=F, col.names=T, sep='\t', quote=F) rownames(otumat) <- otumat$taxid otumat$taxid <- NULL otumat <- as.matrix(otumat) rownames(sam) <- colnames(otumat) OTU = otu_table(otumat, taxa_are_rows = TRUE) TAX = tax_table(taxmat) SAM = sample_data(sam) physeq = phyloseq(OTU, TAX, SAM) pphyseq = transform_sample_counts(physeq, function(x) x / sum(x) ) theme_set(theme_bw()) for (level in colnames(taxmat)){ top.taxa <- tax_glom(physeq, level) TopNOTUs <- names(sort(taxa_sums(top.taxa), TRUE)[1:20]) BottumNOTUs <- names(taxa_sums(top.taxa))[which(!names(taxa_sums(top.taxa)) %in% TopNOTUs)] merged_physeq = merge_taxa(top.taxa, BottumNOTUs, 2) mdf = psmelt(merged_physeq); names(mdf)[names(mdf) == level] <- "level" mdf$OTU[which(is.na(mdf$level))] <- "aaaOther" mdf$level[which(is.na(mdf$level))] <- "aaaOther" aggr_mdf <- aggregate(Abundance ~ sample + run + method + level, data = mdf, sum) labs = aggr_mdf$level; labs[labs=="aaaOther"] <- "Other" cols = scales::hue_pal()(length(unique(labs))); cols[unique(labs) == "Other"] <- "#CCCCCC" p = ggplot(aggr_mdf, aes_string(x = "method", y = "Abundance", fill = "level")) p = p + scale_fill_manual(name = level, labels = unique(labs), values = cols) p = p + facet_grid(paste0("", aggr_mdf$run) ~ paste0("", aggr_mdf$sample)) p = p + geom_bar(stat = "identity", position = "stack", color = "black", size = 0.1) p = p + guides(fill=guide_legend(ncol=1)) p = p + labs(x = "Method", y = "Absolute abundance") p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0, size = 5)) ggsave(paste0("plots/aabund-", level, "-by-sample.pdf"), plot=p, device="pdf") p = ggplot(aggr_mdf, aes_string(x = "sample", y = "Abundance", fill = "level")) p = p + scale_fill_manual(name = level, labels = unique(labs), values = cols) p = p + facet_grid(paste0("", aggr_mdf$run) ~ paste0("", aggr_mdf$method)) p = p + geom_bar(stat = "identity", position = "stack", color = "black", size = 0.1) p = p + guides(fill=guide_legend(ncol=1)) p = p + labs(x = "Sample", y = "Absolute abundance") p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0, size = 5)) ggsave(paste0("plots/aabund-", level, "-by-method.pdf"), plot=p, device="pdf") } for (level in colnames(taxmat)){ top.taxa <- tax_glom(pphyseq, level) TopNOTUs <- names(sort(taxa_sums(top.taxa), TRUE)[1:20]) BottumNOTUs <- names(taxa_sums(top.taxa))[which(!names(taxa_sums(top.taxa)) %in% TopNOTUs)] merged_physeq = merge_taxa(top.taxa, BottumNOTUs, 2) mdf = psmelt(merged_physeq); names(mdf)[names(mdf) == level] <- "level" mdf$OTU[which(is.na(mdf$level))] <- "aaaOther" mdf$level[which(is.na(mdf$level))] <- "aaaOther" aggr_mdf <- aggregate(Abundance ~ sample + run + method + level, data = mdf, sum) labs = aggr_mdf$level; labs[labs=="aaaOther"] <- "Other" cols = scales::hue_pal()(length(unique(labs))); cols[unique(labs) == "Other"] <- "#CCCCCC" p = ggplot(aggr_mdf, aes_string(x = "method", y = "Abundance", fill = "level")) p = p + scale_fill_manual(name = level, labels = unique(labs), values = cols) p = p + facet_grid(paste0("", aggr_mdf$run) ~ paste0("", aggr_mdf$sample)) p = p + geom_bar(stat = "identity", position = "stack", color = "black", size = 0.1) p = p + guides(fill=guide_legend(ncol=1)) p = p + labs(x = "Method", y = "Absolute abundance") p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0, size = 5)) ggsave(paste0("plots/rabund-", level, "-by-sample.pdf"), plot=p, device="pdf") p = ggplot(aggr_mdf, aes_string(x = "sample", y = "Abundance", fill = "level")) p = p + scale_fill_manual(name = level, labels = unique(labs), values = cols) p = p + facet_grid(paste0("", aggr_mdf$run) ~ paste0("", aggr_mdf$method)) p = p + geom_bar(stat = "identity", position = "stack", color = "black", size = 0.1) p = p + guides(fill=guide_legend(ncol=1)) p = p + labs(x = "Sample", y = "Relative abundance") p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0, size = 5)) ggsave(paste0("plots/rabund-", level, "-by-method.pdf"), plot=p, device="pdf") } |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | rm(list=ls()) suppressPackageStartupMessages(library("DECIPHER")) args = commandArgs(trailingOnly=TRUE) threads = as.numeric(args[4]) pctth = as.numeric(args[5]) query <- readDNAStringSet(args[2]) db <- load(args[1]) taxann <- IdTaxa(query, trained_classifiyer, strand = "both", threshold = pctth, processors = threads, verbose = F) assignment <- sapply(taxann, function(x) paste(x$taxon, collapse="\t")) names(assignment) <- sapply(strsplit(names(assignment)," "), `[`, 1) dl <- lapply(assignment, function (x) {gsub("D_[0-6]__", "", x)}) dl <- lapply(dl, function (x) {gsub("Root\t", "", x)}) df <- setNames(as.data.frame(unlist(dl)),"Domain\tPhylum\tClass\tOrder\tFamily\tGenus") df$'#readid' <- rownames(df) write.table(df[,c(2,1)], file=args[3], row.names=F, col.names=T, quote=F, sep='\t') |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 | rm(list=ls()) suppressPackageStartupMessages(library("DECIPHER")) args = commandArgs(trailingOnly=TRUE) # train classifyer refseqs <- readDNAStringSet(args[1]) taxo = read.table(args[2], sep = "\t", comment.char = "", quote="" ) taxonomy <- paste0(taxo$V1, " Root;", taxo$V2) trained_classifiyer <- LearnTaxa(train = refseqs, taxonomy = taxonomy, verbose = T) save(trained_classifiyer, file = args[3]) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | suppressPackageStartupMessages(library(ggplot2)) args = commandArgs(trailingOnly=TRUE) dl = list() for (i in 1:length(args)){ # Get metadata lab = strsplit(args[i], "/")[[1]][4] run = strsplit(args[i], "/")[[1]][2] sample = strsplit(lab, "[.]")[[1]][1] method = strsplit(lab, "[.]")[[1]][2] # Read data and compute total prop precision tbl <- read.table(args[i], header = T, sep = '\t', comment = "", row.names = 1) dt <- as.data.frame(colSums(tbl, na.rm=T)/ colSums(!is.na(tbl))) names(dt) <- "precision" # Add meta data columns dt$level <- row.names(dt) dt$rank <- factor(dt$level, ordered=T, levels=c("Domain","Phylum","Class", "Order","Family","Genus")) dt$run <- rep(run, nrow(dt)) dt$sample <- rep(sample, nrow(dt)) dt$method <- rep(method, nrow(dt)) # Add to list dl[[i]] <- dt } df <- do.call(rbind, dl) theme_set(theme_bw()) p = ggplot(df, aes(x = rank, y = precision, group = method, col = method)) + geom_line() + geom_point() + labs(x = "", y = "Precision") + ylim(0,1) + theme(axis.text.x = element_text( angle = 90, vjust = 0.5, hjust=1)) p = p + facet_grid(paste0("Run: ",run) ~ paste0("",sample)) ggsave("plots/precision.pdf", plot=p, device="pdf") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | suppressPackageStartupMessages(library(ggplot2)) args = commandArgs(trailingOnly=TRUE) dl = list() for (i in 1:length(args)){ # Get metadata lab = strsplit(args[i], "/")[[1]][3] run = strsplit(args[i], "/")[[1]][2] sample = strsplit(strsplit(lab, "[_]")[[1]][3], "[.]")[[1]][1] method = strsplit(lab, "[_]")[[1]][1] # Read data dt <- as.data.frame(read.table(args[i], header = T, sep = '\t', comment = "")[,1]) # Add meta data columns dt$run <- rep(run, nrow(dt)) dt$sample <- rep(sample, nrow(dt)) dt$method <- rep(method, nrow(dt)) # Add to list dl[[i]] <- dt } df <- do.call(rbind, dl) names(df)[1] <- "s" theme_set(theme_bw()) p1 = ggplot(df, aes(x = method, y = s, fill=method)) + geom_bar(stat="identity", color="black") + labs(x = "", y = "Runtime (s)") + theme(axis.text.x = element_text( angle = 90, vjust = 0.5, hjust=1)) p1 = p1 + facet_grid(paste0("Run: ",run) ~ paste0("Sample: ",sample)) p1 = p1 + theme(legend.position="none") ggsave("plots/runtime-by-sample.pdf", plot=p1, device="pdf") q1 = ggplot(df, aes(x = method, y = s, fill=method)) + geom_bar(stat="identity", color="black") + scale_y_continuous(trans = "log10") + labs(x = "", y = "Runtime (s)") + theme(axis.text.x = element_text( angle = 90, vjust = 0.5, hjust=1)) q1 = q1 + facet_grid(paste0("Run: ",run) ~ paste0("Sample: ",sample)) q1 = q1 + theme(legend.position="none") ggsave("plots/runtime_log-by-sample.pdf", plot=q1, device="pdf") p2 = ggplot(df, aes(x = sample, y = s, fill=sample)) + geom_bar(stat="identity", color="black") + labs(x = "", y = "Runtime (s)") + theme(axis.text.x = element_text( angle = 90, vjust = 0.5, hjust=1)) p2 = p2 + facet_grid(paste0("Run: ",run) ~ paste0("Method: ",method)) p2 = p2 + theme(legend.position="none") ggsave("plots/runtime-by-method.pdf", plot=p2, device="pdf") q2 = ggplot(df, aes(x = sample, y = s, fill=sample)) + geom_bar(stat="identity", color="black") + scale_y_continuous(trans = "log10") + labs(x = "", y = "Runtime (s)") + theme(axis.text.x = element_text( angle = 90, vjust = 0.5, hjust=1)) q2 = q2 + facet_grid(paste0("Run: ",run) ~ paste0("Method: ",method)) q2 = q2 + theme(legend.position="none") ggsave("plots/runtime_log-by-method.pdf", plot=q2, device="pdf") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | __author__ = "Evelien Jongepier" __copyright__ = "Copyright 2020, Evelien Jongepier" __email__ = "e.jongepier@uva.nl" __license__ = "MIT" import sys import re import argparse import collections class ToConsensus(object): def __init__(self, taxlists, consensustax, pconsensus): self.taxlists = taxlists self.consensustax = consensustax self.pconsensus = pconsensus return def taxlist_per_sample_per_run_dict(self): ''' Example outout: {run : {sample1 : [path1, path2, ...], sample2 : [path3, path4, ...], ...} Stores the path to each of the input taxlists for each of the samples for each of the runs. ''' sampledict = collections.defaultdict( dict = collections.defaultdict(list)) for idx, taxlist in enumerate(self.taxlists): run = taxlist.split('/')[1] sample = taxlist.split('/')[3].split('.')[0] if run not in sampledict: sampledict[run] = {sample:[taxlist]} elif sample not in sampledict[run]: sampledict[run][sample] = [taxlist] else: sampledict[run][sample].append(taxlist) return sampledict def run_all_runs_n_samples(self): ''' ''' sampledict = self.taxlist_per_sample_per_run_dict() for run in sampledict: for sample in sampledict[run]: ftaxlists = sampledict[run][sample] samplereaddict = self.get_read_dict_per_sample(ftaxlists) nlists = len(ftaxlists) consensusreaddict = self.comp_consensus_tax(samplereaddict, nlists) for ftaxlist in ftaxlists: precisiondict = self.comp_precision_per_list(consensusreaddict, ftaxlist) path = self.get_output_path(ftaxlist) self.write_precision_tax(precisiondict, path) #self.write_consensus_tax( return def get_output_path(self, ftaxlist): dir = '/'.join(ftaxlist.split('/')[:3]) method = ftaxlist.split('/')[2] sample = ftaxlist.split('/')[3].split('.')[0] outpath = dir + '/' + sample + '.' + method + '.precision' return outpath def get_read_dict_per_list(self, taxlist): ''' Example output: readdict = {read_id: {Domain:tool1}, {Phylum:tool1},{Class:tool1},...} Stores tax classification of a single taxlist (i.a the taxo- nomic classification based on a single tool) into a dict of dict. Outer dict is readid as key and dict of tax as value. Inner / nested dicts have tax level is a key with tax annotation at that level for that read in that taxlist as str. ''' readdict = collections.defaultdict(dict) with open (taxlist) as f: for l in f.readlines(): if l.startswith("#"): headers = l.strip().split('\t')[1:] else: readid = l.strip().split('\t')[0] taxlist = l.strip().split('\t')[1:7] readdict[readid] = dict( (headers[idx], tax) for idx, tax in enumerate(taxlist)) return readdict def get_read_dict_per_sample(self, ftaxlists): ''' Example output: readdict = {read_id: {Domain:[tool1, tool2, tool3,...]},{Phylum:[tool1, tool2,tool3,...]},...} Stores tax classification of all taxlists (i.a the taxo- nomic classification based on a all tools) into a dict of dict of lists. Outer dict is readid as key and dict of tax as value. Inner / nested dicts have tax level is a key with tax annotation at that level for that read in each of the taxlist (thus tools) as list. ''' samplereaddict = collections.defaultdict( dict = collections.defaultdict(list)) for taxlist in ftaxlists: readdict = self.get_read_dict_per_list(taxlist) for readid in readdict: if readid not in samplereaddict: samplereaddict[readid] = readdict[readid] # change tax into list so all taxlists can be appended for level in readdict[readid]: samplereaddict[readid][level] = [samplereaddict[readid][level]] else: for level in readdict[readid]: tax = readdict[readid][level] samplereaddict[readid][level].append(tax) return samplereaddict def comp_consensus_tax(self, samplereaddict, nlists): ''' Example output: {readid = {Domain:cons},{Phylum:cons},...} Compute consensus for each read at each taxonomic level based on user defined level (0.5 = majority) and the taxo- nomic classificatons of each tool. ''' consensusreaddict = collections.defaultdict(dict) for readid in samplereaddict: for level in samplereaddict[readid]: freqdict = collections.Counter(samplereaddict[readid][level]) maxkey = max(freqdict, key=freqdict.get) if (freqdict[maxkey] / float(nlists)) > self.pconsensus: consensusreaddict[readid][level] = maxkey else: consensusreaddict[readid][level] = "NA" return consensusreaddict def comp_precision_per_list(self, consensusreaddict, ftaxlist): ''' Example output: {readid = {Domain:bool}, {Phylum:bool}, ...} For each read and each taxomic level, a boolian whether the classification matches the consesnus clasification or not. ''' readdict = self.get_read_dict_per_list(ftaxlist) precisiondict = collections.defaultdict(dict) for readid in readdict: for level in readdict[readid]: if readdict[readid][level] == 'NA' or consensusreaddict[readid][level] == 'NA': precisiondict[readid][level] = 'NA' elif readdict[readid][level] == consensusreaddict[readid][level]: precisiondict[readid][level] = '1' else: precisiondict[readid][level] = '0' return precisiondict def write_consensus_tax(self): ''' Example out: "readid \t Domain \t Phylum ..." Write consensus to file, incl. header line. ''' consensusreaddict = self.comp_consensus_tax() levels = ['Domain','Phylum','Class','Order','Family','Genus'] with open(self.consensustax, 'w') as f: f.write('#readid' + '\t' + '\t'.join(levels) + '\n') for readid in consensusreaddict: tax = [] for level in levels: tax.append(consensusreaddict[readid][level]) f.write(readid + '\t' + '\t'.join(tax) + '\n') return def write_precision_tax(self, precisiondict, path): ''' Example out: "readid \t Domain \t Phylum ..." Write precision (0/1) to file, incl. header line. ''' levels = ['Domain','Phylum','Class','Order','Family','Genus'] with open(path, 'w') as f: f.write('#readid' + '\t' + '\t'.join(levels) + '\n') for readid in precisiondict: tax = [] for level in levels: tax.append(precisiondict[readid][level]) f.write(readid + '\t' + '\t'.join(tax) + '\n') return def main(taxlists, consensustax, pconsensus): m = ToConsensus( taxlists = taxlists, consensustax = consensustax, pconsensus = pconsensus ) m.run_all_runs_n_samples() return if __name__ == '__main__': parser = argparse.ArgumentParser(description='Obtain consensus and ' 'majority taxonomic classifications based on a ' 'list of taxlists as generated by various classi' 'fication tools. Taxlists can be obtained with ' 'tomat.py') parser.add_argument('-l', '--taxlists', nargs='+', default=None, help='List of paths to taxlists, which contain ' 'an taxonomic identifier in the first column followed ' 'by tab-delimited 6-level taxonomic classifications ' 'sensu Silva.') parser.add_argument('-o', '--consensustax', default=None, help='Paths to where output consensus taxonomy ' 'should be written.') parser.add_argument('-c', '--pconsensus', type=float, default=0.5, help='Consensus treshold ranging between 0.5 and 1. ' 'Minimum fraction of studies that need to show ' 'consensus about each taxonomic classification at ' 'each taxonomic level. Majority obtained with 0.5. ' 'Taxonomic annotations for which there is no consensus ' 'are returned as NA.') args = parser.parse_args() assert (len(args.taxlists) > 2), ''' \n Please provide at least three taxlists to determine consensus taxonomy. Read the help (--help) for further information. ''' main( taxlists = args.taxlists, consensustax = args.consensustax, pconsensus = args.pconsensus ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | __author__ = "Evelien Jongepier" __copyright__ = "Copyright 2020, Evelien Jongepier" __email__ = "e.jongepier@uva.nl" __license__ = "MIT" import sys import re import argparse import collections from Bio import SeqIO class ToDb(object): def __init__(self, refseq, reftax, method, outseq, outtax): self.refseq = refseq self.reftax = reftax self.method = method self.outseq = outseq self.outtax = outtax return def get_ref_dict(self): ''' Example out: {id1:"domain1;phylum1;...", id2:"domain2;phylum2;..."} Create a dictionary with sequence id as key and taxonomic string as value. The exact formatting of the taxonomic string differs between methods. ''' refdict = {} with open(self.reftax) as t: for line in t.readlines(): id, tax = line.strip().split('\t')[:2] tax = re.sub(r'D_[0-6]__', '', tax) tax = re.sub(r' ', '_', tax) if self.method == 'centrifuge': refdict[id] = tax else: tax = tax.split(';')[:6] if self.method == 'mothur': refdict[id] = ';'.join(tax) if self.method == 'rdp': refdict[id] = ';'.join(tax) if self.method == 'spingo': tax.reverse() refdict[id] = '\t'.join(tax) return refdict def get_out_files(self): ''' Reformat fasta header to make it compatible with the selected method. ''' with open(self.outseq, "w") as outs, open(self.outtax, "w") as outt: refdict = self.get_ref_dict() for record in SeqIO.parse(open(self.refseq), 'fasta'): if record.id in refdict: record.description = '' outt.write(record.id + '\t' + refdict[record.id] + '\n') if self.method == 'mothur': record.id = record.id + '\tNA\t' + refdict[record.id] if self.method == 'rdp': record.id = refdict[record.id] + ';' if self.method == 'spingo': record.id = record.id + "]\t" + refdict[record.id] if self.method == 'centrifuge': record.id = record.id + ' ' + refdict[record.id] SeqIO.write(record, outs, "fasta") def main(refseq, reftax, method, outseq, outtax): m = ToDb( refseq = refseq, reftax = reftax, method = method, outseq = outseq, outtax = outtax ) m.get_out_files() return if __name__ == '__main__': parser = argparse.ArgumentParser(description='Convert qiime2-formatted ' 'database, consisting of a sequence file and a 7-level ' 'taxonomy file, into database file(s) compatible with ' 'other taxonomic classification methods.') parser.add_argument('-s', '--refseq', default=None, help='Path to fasta file of reference nucleotide sequences. ' 'Headers should contain only the sequence identifier.') parser.add_argument('-t', '--reftax', default=None, help='Path to tab-separated table with reference taxonomy. ' 'The first column should contain the sequence identifier ' 'as given in "refseq", the second column a semi-colon-' 'separated 7-level taxonomy.') parser.add_argument('-m', '--method', default=None, help='A string specifying for the which taxonomic classi' 'fication tool the output db files should be re-formatted. ' 'Legit values are "mothur", "centrifuge", "rdp" or "spingo".') parser.add_argument('-S', '--outseq', default=None, help='Path to output fasta file with re-formatted sequences.') parser.add_argument('-T', '--outtax', default=None, help='Path to tab separated output table with re-formatted ' 'taxonomy.') args = parser.parse_args() main( refseq = args.refseq, reftax = args.reftax, method = args.method, outseq = args.outseq, outtax = args.outtax ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | __author__ = "Evelien Jongepier" __copyright__ = "Copyright 2021, Evelien Jongepier" __email__ = "e.jongepier@uva.nl" __license__ = "MIT" import sys import re import argparse import collections class ToLCA(object): def __init__(self, blast, taxonomy, lca, pconsensus): self.blast = blast self.taxonomy = taxonomy self.lca = lca self.pconsensus = pconsensus self.levels = ["Domain","Phylum","Class","Order","Family","Genus"] return def get_blastdict(self): ''' Example output: readdict = {read_id: [hitid1, hitid2, hitid3],...} Stores list of blast hit id for each read in dict. ''' readdict = collections.defaultdict(list) with open (self.blast) as f: for l in f.readlines(): readid = l.strip().split('\t')[0] hitid = l.strip().split('\t')[1] if readid not in readdict: readdict[readid] = [hitid] else: readdict[readid].append(hitid) return readdict def get_dbdict(self): ''' Example output: dbdict = {dbid: [domain, order, class, ...], ...} Stores list of taxonomic annotations (i.e. 7 levels) for each refernce in the db. ''' dbdict = collections.defaultdict(list) with open (self.taxonomy) as f: for l in f.readlines(): dbid, dbtaxstr = l.strip().split('\t')[:2] dbtaxstr = re.sub(r'D_[0-6]__', '', dbtaxstr) dbtax = dbtaxstr.split(';')[:6] dbdict[dbid] = dbtax return dbdict def get_classdict(self): ''' Example output: classdict = {readid: {domain: [hit1, hit2, hit3, ...], ...}, ...} Stores list of classifications for each of the 7 levels of the taxonomy for each dbhit of each read. Redundant hits, not yet consensus. ''' classdict = collections.defaultdict( dict = collections.defaultdict(list) ) readdict = self.get_blastdict() dbdict = self.get_dbdict() for readid in readdict: for hitid in readdict[readid]: dbtaxs = dbdict[hitid] if readid not in classdict: classdict[readid] = dict( (self.levels[idx], [tax]) for idx, tax in enumerate(dbtaxs)) else: for idx, tax in enumerate(dbtaxs): classdict[readid][self.levels[idx]].append(tax) return classdict def comp_lca(self): ''' Example output: {readid = {Domain:cons},{Phylum:cons},...} Compute consensus for each read at each taxonomic level based on user defined level (0.5 = majority) and the taxo- nomic classificatons of each hit. ''' lcadict = collections.defaultdict(dict) classdict = self.get_classdict() for readid in classdict: for level in classdict[readid]: freqdict = collections.Counter(classdict[readid][level]) totfreq = sum(freqdict.values()) maxkey = max(freqdict, key=freqdict.get) if (freqdict[maxkey] / float(totfreq)) > self.pconsensus: lcadict[readid][level] = maxkey else: lcadict[readid][level] = "NA" return lcadict def write_lca(self): ''' Example out: "readid \t Domain \t Phylum ..." Write consensus to file, incl. header line. ''' lcadict = self.comp_lca() with open(self.lca, 'w') as f: f.write('#readid' + '\t' + '\t'.join(self.levels) + '\n') for readid in lcadict: tax = [] for level in self.levels: tax.append(lcadict[readid][level]) f.write(readid + '\t' + '\t'.join(tax) + '\n') return def main(blast, taxonomy, lca, pconsensus): m = ToLCA( blast = blast, taxonomy = taxonomy, lca = lca, pconsensus = pconsensus ) m.write_lca() return if __name__ == '__main__': parser = argparse.ArgumentParser(description='Obtain LCA based on blast ' 'tabular and user defined consensus treshold.') parser.add_argument('-b', '--blast', default=None, help='Path to blast tabular with read id and reference ' 'hit id.') parser.add_argument('-t', '--taxonomy', default=None, help='Path to taxonomy with reference id and 7-level ' 'taxonomy sensu Silva.') parser.add_argument('-l', '--lca', default=None, help='Path to where output lca consensus taxonomy ' 'should be written.') parser.add_argument('-c', '--pconsensus', type=float, default=0.5, help='Consensus treshold ranging between 0.5 and 0.99. ' 'Minimum fraction of studies that need to show ' 'consensus about each taxonomic classification at ' 'each taxonomic level. Majority obtained with 0.5. ' 'Taxonomic annotations for which there is no consensus ' 'are returned as NA.') args = parser.parse_args() main( blast = args.blast, taxonomy = args.taxonomy, lca = args.lca, pconsensus = args.pconsensus ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | __author__ = "Evelien Jongepier" __copyright__ = "Copyright 2020, Evelien Jongepier" __email__ = "e.jongepier@uva.nl" __license__ = "MIT" import sys import re import argparse from collections import defaultdict class ToMat(object): def __init__(self, taxlist, bblastlike, centrifugelike, krakenlike, dbtaxonomy, dbfasta, dbmap): self.taxlist = taxlist self.bblastlike = bblastlike self.centrifugelike = centrifugelike self.krakenlike = krakenlike self.dbtaxonomy = dbtaxonomy self.dbfasta = dbfasta self.dbmap = dbmap if taxlist is not None: self.pathfile = taxlist elif bblastlike is not None: self.pathfile = bblastlike elif centrifugelike is not None: self.pathfile = centrifugelike elif krakenlike is not None: self.pathfile = krakenlike self.dir = '/'.join(self.pathfile.split('/')[:3]) self.run = self.pathfile.split('/')[1] self.method = self.pathfile.split('/')[2] self.sample = self.pathfile.split('/')[3].split('.')[0] self.outpath = self.dir + '/' + self.sample + '.' + self.method ## if input is bblastlike then create otumat and taxmat from taxlist created here if any([bblastlike, centrifugelike, krakenlike]): self.taxlist = self.outpath + ".taxlist" return def get_reftax_dict(self): ''' Example output: {refid : [Domain, Phylum, ...]} Create a dict of taxdb identifyers and taxdb taxonomy. Input file can be taxonomy table or fasta with taxonomy in header, depending which one is provided. ''' refdict = {} if self.dbtaxonomy is not None: with open(self.dbtaxonomy) as reftax: for l in reftax.readlines(): refid, taxstr = l.strip().split('\t')[:2] taxstr = re.sub(r'D_[0-6]__', '', taxstr) refdict[refid] = taxstr.split(';')[:6] elif self.dbfasta is not None: with open(self.dbfasta) as fasta: for l in fasta.readlines(): if l.startswith('>'): line = l.strip().split('>') refid = line[1].split(' ')[0] taxstr = line[1].split(' ')[1] refdict[refid] = taxstr.split(';')[:6] return refdict def get_krakentax_dict(self): ''' Example output: {ncbiid : [silvaid1,silvaid2,...]} For krakenlike input the ncbi refids should first be translated into silva refids. This creates a dictionary of silva to ncbi id. Each silvaid has only one ncbiid but same ncbiid's can have multiple silvaid's. ''' krakendict = defaultdict(list) with open (self.dbmap) as map: for l in map.readlines(): line = l.strip().split('\t') ncbiid = line[1] krakendict[ncbiid].append(line[0]) return krakendict def get_krakenlike_read_dict(self): ''' Example output: {readid : [refid1, refid2, ...]} For krakenlike input the ncbi refids in the input table should first be replaces by the silva refids. There are multiple silva ref ids associated with each readid hense a dict of lists. ''' with open(self.krakenlike) as f: readdict = defaultdict(list) krakendict = self.get_krakentax_dict() for l in f.readlines(): if l.startswith("#") or not l.strip(): continue else: readid, refrefid = l.strip().split('\t')[1:3] if refrefid in krakendict: readdict[readid].extend(krakendict[refrefid]) return readdict def get_read_dict(self): ''' Example output: {readid : [refid1, refid2, ...]} Read a bblastlike file and store reads as keys in a dictionary with tadid's as values list. Some classifiers return multiple refids per read, hense the list (e.g. centrifuge, but not bestblast). ''' readdict = defaultdict(list) with open(self.pathfile) as f: for l in f.readlines(): if l.startswith("#") or not l.strip(): continue else: readid, refid = l.strip().split('\t')[:2] readdict[readid].append(refid) return readdict def parse_read_dict(self): ''' Parse correct read dict depending on input data type. ''' if self.krakenlike is not None: readdict = self.get_krakenlike_read_dict() else: readdict = self.get_read_dict() return readdict def get_taxlist_dict(self): ''' Example output: {readid : [Domain, Phylum, Class..]} Read a bblast like file and link taxdb identifyers to taxdb taxonomy in refdict to store in taxlist dictionary. ''' readdict = self.parse_read_dict() refdict = self.get_reftax_dict() taxdict = defaultdict(list) for readid in readdict: refidlist = readdict[readid] for refid in refidlist: if refid in refdict and len(refdict[refid]) > 5: ''' If already in dict, compare for each tax level whether there is concensus. If not replace with NA. ''' if readid in taxdict: for level in range(5, -1, -1): if taxdict[readid][level] == refdict[refid][level]: break else: taxdict[readid][level] = "NA" else: taxdict[readid] = refdict[refid] return taxdict def write_taxlist(self): ''' Example out: "readid \t Domain \t Phylum ..." Write taxlist to file, incl. header line. ''' taxlist = self.get_taxlist_dict() with open(self.outpath + ".taxlist", 'w') as f: f.write('#readid' + '\t' + 'Domain' + '\t' + 'Phylum' + '\t' + 'Class' + '\t' + 'Order' + '\t' + 'Family' + '\t' + 'Genus' + '\n') for readid in taxlist: tax = taxlist[readid] f.write(readid + '\t' + '\t'.join(tax) + '\n') def get_taxmat_dict(self): ''' Example out: {taxstr: [Domain,Phylum,...]} Read taxlist, concatenate taxonomy into taxid and create lib with key taxid and value taxonomy. ''' with open(self.taxlist) as f: taxdict = {} for l in f.readlines(): if l.startswith("#") or not l.strip(): continue else: line = l.strip().split('\t') tax = line[1:7] taxid = '_'.join(tax) taxdict[taxid] = tax return taxdict def write_taxmat(self): ''' Example out: "taxstr \t Domain \t Phylum ..." Write taxmat to file, incl. header line. ''' with open(self.outpath + ".taxmat", 'w') as f: f.write('#taxid' + '\t' + 'Domain' + '\t' + 'Phylum' + '\t' + 'Class' + '\t' + 'Order' + '\t' + 'Family' + '\t' + 'Genus' + '\n') for taxid in self.get_taxmat_dict(): tax = self.get_taxmat_dict()[taxid] taxlab = re.sub(r'[_NA]*_NA$', '', taxid) taxstr = '\t'.join(tax) taxstr = re.sub(r'[\tNA]*\tNA$', '', taxstr) taxstr = re.sub(r'_', ' ', taxstr) f.write(taxlab + '\t' + taxstr + '\n') def get_otumat_dict(self): ''' Example out: {taxstr: count} Read taxlist, concatenate taxonomy into taxid and create lib witk key taxid and value count where count is the frequency each taxid occured in taxlist ''' with open(self.taxlist) as f: otudict = defaultdict(int) for l in f.readlines(): if l.startswith("#") or not l.strip(): continue else: line = l.strip().split('\t') tax = line[1:7] taxid = '_'.join(tax) otudict[taxid] += 1 return otudict def write_otumat(self): ''' Example output: "taxstr \t count" Write otumat to file, incl. header line. ''' with open(self.outpath + ".otumat", 'w') as f: f.write('#taxid' + '\t' + self.method + '_' + self.sample + '\n') for taxid in self.get_otumat_dict(): count = self.get_otumat_dict()[taxid] taxlab = re.sub(r'[_NA]*_NA$', '', taxid) f.write(taxlab + '\t' + str(count) + '\n') def main(taxlist, bblastlike, centrifugelike, krakenlike, dbtaxonomy, dbfasta, dbmap): m = ToMat( taxlist = taxlist, bblastlike = bblastlike, centrifugelike = centrifugelike, krakenlike = krakenlike, dbtaxonomy = dbtaxonomy, dbfasta = dbfasta, dbmap = dbmap ) if any([bblastlike, centrifugelike, krakenlike]): m.write_taxlist() m.write_taxmat() m.write_otumat() return if __name__ == '__main__': parser = argparse.ArgumentParser(description='Convert output tables ' 'from classifyers in nanoclass snakemake pipeline ' 'to a taxonomy and otu matrix. These serve to ' 'generate taxonomic barplots. If not given, the ' 'script also produces a taxlist which serves as input ' 'to assign consensus and majority classifications.') parser.add_argument('-l', '--taxlist', default=None, help='Path to a file with read IDs in the ' 'first column, followed by at least 6 columns ' 'with tab-delimited 7-level taxonomic classifications ' 'sensu Silva.') parser.add_argument('-b', '--bblastlike', default=None, help='Path to a best-blast-like output file ' 'with the first column listing the readID and ' 'the second column the taxonomic identifyer.') parser.add_argument('-c', '--centrifugelike', default=None, help='Path to a centrifuge-like output file. ' 'First column should contain read ID and second ' 'the taxonomic identifyer.') parser.add_argument('-k', '--krakenlike', default=None, help='Path to a kraken-like output file. ' 'Second column should contain read ID and third ' 'the ncbi taxonomic identifyer.') parser.add_argument('-t', '--dbtaxonomy', default=None, help='Path to the taxonomy of the database used ' 'to get taxonomic classifications. Only required ' 'in combination with --bblastlike.') parser.add_argument('-f', '--dbfasta', default=None, help='Path to the fasta of database that was used ' 'to get taxonomic classifications. Fasta headers ' 'should contain taxids and taxo string. Only required ' 'in combination with --centrifugelike.') parser.add_argument('-m', '--dbmap', default=None, help='Path to the map of database that was used ' 'to get taxonomic classifications. Map contains ' 'link between taxonomy in third column of ' '--krakenlike and taxids in --dbfasta. Only ' 'required in combination with --krakenlike.') args = parser.parse_args() assert any([args.taxlist, args.bblastlike, args.centrifugelike, args.krakenlike]), ''' \n Please provide the path to the taxlist or bblastlike. Read the help (--help) for further information. ''' main( taxlist = args.taxlist, bblastlike = args.bblastlike, centrifugelike = args.centrifugelike, krakenlike = args.krakenlike, dbtaxonomy = args.dbtaxonomy, dbfasta = args.dbfasta, dbmap = args.dbmap ) |
Support
- Future updates
Related Workflows





