NanoClass: A Taxonomic Meta-Classifier for Oxford Nanopore Amplicon Sequencing Data

public 1yr ago Version: v0.3.1-beta 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

NanoClass is a taxonomic meta-classifier for 16S/18S amplicon sequencing data generated with the Oxford Nanopore MinION. With a single command, you can run upto 11 popular classification tools on multiple samples in parallel, including BLASTN, Centrifuge, Kraken2, IDTAXA, MegaBLAST, dcMegaBLAST, Minimap2, Mothur, QIIME2, RDP and SPINGO. Optional read preparation steps, such as quality trimming, length filtering and sub-sampling, are an integral part of the pipeline.

NanoClass automatically installs all software packages and dependencies, downloads and builds required taxonomic databases and runs the analysis on your samples.

Getting started

Requirements

The entire NanoClass workflow can be run on a powerfull desktop computer, but for many applications a laptop will do. Most classification tools implemented in NanoClass run in a matter of minutes to hours. Prerequisites are Conda and Snakemake .

NanoClass automatically installs all other software packages and dependencies.

Installation

You can either clone NanoClass, like so:

git clone https://github.com/ejongepier/NanoClass.git

Or download and extract the zip archive from https://github.com/ejongepier/NanoClass.

NanoClass is immediately ready for use. See also the Documentation .

Usage

Quick start

Enter your samples and the paths to your fastq.gz files in the sample.csv. Sample labels should be unique. Both sample and run labels should contain letters and numbers only. Barcode column should be left empty, meaning your input files should already be demultiplexed. For an example see the sample.csv file.

After editing the samples.csv, the entire pipeline can be run with a single command:

snakemake --use-conda --cores <ncores>

Where --cores are the number of CPU cores/jobs that can be run in parallel on your system.

Customizing

You can customize the pipeline through the config.yaml, e.g. by running only a subset of the reads or classification tools, or by changing the default Silva 16S taxonomic database.
For details on how to customize NanoClass, see the Documentation .

Report

After successful execution, you can create an interactive HTML report with:

snakemake --report report/NanoClass.zip

Authors

Evelien Jongepier (e.jongepier@uva.nl)

Code Snippets

shell:
    """
    makeblastdb -in {input} -parse_seqids \
        -dbtype nucl 2>&1 | tee -a {log}
    """

SnakeMake From line 13 of rules/blastn.smk

shell:
    """
    fasta-splitter --n-parts {params.n_chunks} {input} \
        --nopad --out-dir {params.out_dir}
    """

SnakeMake From line 32 of rules/blastn.smk

shell:
    """
    blastn -task 'blastn' -db {input.db} -evalue {params.eval} \
        -perc_identity {params.pident} -max_target_seqs {params.nseqs} \
        -query {input.fasta} -out {output.all} -outfmt 6 \
        2> {log}
    """

SnakeMake BLAST From line 59 of rules/blastn.smk

shell:
    """
    cat {input.out} | awk -F '\\t' -v l={params.alnlen} \
        '{{if ($4 > l) print $0}}' \
        > {output.out}
    cat {input.benchm} | grep -P "^[0-9]" | \
      awk '{{sum+=$1}} END {{print "s"; print sum/{params.threads}}}' \
      > {output.benchm} 2> {log}
    """

SnakeMake From line 88 of rules/blastn.smk

shell:
    """
    scripts/tolca.py -b {input.blast} -t {input.db} \
         -l {output} -c {params.lcacons} > {log}
    """

SnakeMake From line 113 of rules/blastn.smk

shell:
    "scripts/tomat.py -l {input} 2> {log}"

SnakeMake From line 132 of rules/blastn.smk

shell:
    """
    centrifuge-download -o db/centrifuge/taxonomy taxonomy > {log} 2>&1
    wget {params.map_url} -q -O - | gzip -d -c - | \
        awk '{{print $1\".\"$2\".\"$3\"\t\"$(NF)}}' \
        > {output.tax_map} 2>> {log}
    scripts/todb.py -s {input.seq} -t {input.tax} -m centrifuge \
        -S {output.ref_seqs} -T {output.ref_tax} 2>> {log}
    """

SnakeMake Centrifuge From line 22 of rules/centrifuge.smk

shell:
    """
    centrifuge-build \
      --threads {threads} \
      --conversion-table {input.conversion_table} \
      --taxonomy-tree {input.tax_tree} \
      --name-table {input.name_table} \
      {input.ref_seqs} {params.prefix} > {log} 2>&1
    """

SnakeMake From line 57 of rules/centrifuge.smk

shell:
    """
    centrifuge -x {params.index_prefix} \
      -U {input.fastq} \
      --threads {threads} \
      --report-file {output.report} \
      -S  {output.classification} \
      --met-stderr > {log} 2>&1
    """

SnakeMake Centrifuge From line 89 of rules/centrifuge.smk

shell:
    "scripts/tomat.py -c {input.out} -f {input.ref_seqs} 2> {log}"

SnakeMake From line 115 of rules/centrifuge.smk

shell:
    """
    wget -q -O db/common/db.zip {params.url}
    unzip -q -p -j db/common/db.zip \
        */taxonomy/{params.ssu}_only/99/majority_taxonomy_7_levels.txt \
        > {output.ref_tax} 2> {log}
    unzip -q -p -j db/common/db.zip \
        */rep_set/rep_set_{params.ssu}_only/99/silva_*_99_{params.ssu}.fna \
        > {output.ref_seqs} 2>> {log}
    unzip -q -p -j db/common/db.zip \
        */rep_set_aligned/99/99_alignment.fna.zip > tmp.aln.zip
    unzip -q -p tmp.aln.zip > {output.ref_aln} 2>> {log}
    rm db/common/db.zip tmp.aln.zip 2>> {log}
    """

SnakeMake From line 18 of rules/common.smk

shell:
    """
    mkdir -p ./plots
    Rscript scripts/barplot.R {input} 2> {log}
    """

SnakeMake From line 66 of rules/common.smk

shell:
    "scripts/toconsensus.py -l {input} 2> {log}"

SnakeMake From line 88 of rules/common.smk

shell:
    """
    mkdir -p ./plots
    Rscript scripts/lineplot.R {input} 2> {log}
    """

SnakeMake From line 106 of rules/common.smk

shell:
    """
    mkdir -p ./plots
    mkdir -p ./tables
    Rscript scripts/timeplot.R {input} 2> {log}
    """

SnakeMake From line 132 of rules/common.smk

shell:
    """
    fasta-splitter --n-parts {params.n_chunks} {input} \
        --nopad --out-dir {params.out_dir}
    """

SnakeMake From line 20 of rules/dcmegablast.smk

shell:
    """
    blastn -task 'dc-megablast' -db {input.db} -evalue {params.eval} \
        -perc_identity {params.pident} -max_target_seqs {params.nseqs} \
        -query {input.fasta} -out {output.all} -outfmt 6 \
        2> {log} 
    """

SnakeMake BLAST From line 46 of rules/dcmegablast.smk

    shell:
        """
	cat {input.out} | awk -F '\\t' -v l={params.alnlen} \
            '{{if ($4 > l) print $0}}' \
            > {output.out}
        cat {input.benchm} | grep -P "^[0-9]" | \
          awk '{{sum+=$1}} END {{print "s"; print sum / {params.threads}}}' \
          > {output.benchm} 2> {log}
        """

SnakeMake From line 76 of rules/dcmegablast.smk

shell:
  	"""
    scripts/tolca.py -b {input.blast} -t {input.db} \
         -l {output} -c {params.lcacons} > {log}
    """

SnakeMake From line 102 of rules/dcmegablast.smk

shell:
    "scripts/tomat.py -l {input} 2> {log}"

SnakeMake From line 121 of rules/dcmegablast.smk

shell:
    "Rscript scripts/learntaxa.R {input.ref_seqs} {input.ref_tax} {output} 2> {log}"

SnakeMake From line 16 of rules/idtaxa.smk

shell:
    """
    Rscript scripts/idtaxa.R {input.db} {input.query} {output.tmp} {threads} {params} 2> {log}
    awk 'BEGIN{{FS=OFS="\\t"}} {{for (i=1; i<=7; i++) if ($i ~ /^ *$/) $i="NA"}} 1' {output.tmp} > {output.out}
    """

SnakeMake From line 39 of rules/idtaxa.smk

shell:
    "scripts/tomat.py -l {input.list} 2> {log}"

SnakeMake From line 59 of rules/idtaxa.smk

shell:
    """
    kraken2-build --db {params.db} --special {params.db_type} \
      --threads {threads} > {log} 2> {log}
    """

SnakeMake kraken2 From line 20 of rules/kraken.smk

shell:
    """
    kraken2 --db {params.db_dir} \
        --output {output.out} \
        --report {output.report} \
        --gzip-compressed \
        --threads {threads} {input.fastq} 2> {log}
    """

SnakeMake kraken2 From line 45 of rules/kraken.smk

shell:
    """
    scripts/tomat.py -k {input.kraken_out} -f {input.silva_seqs} \
      -m {input.kraken_map} 2> {log}
    """

SnakeMake From line 71 of rules/kraken.smk

shell:
    """
    mapseq -nthreads {threads} {input.query} {input.ref_seqs} \
       {input.ref_tax}  > {output} 2> {log}
    """

SnakeMake MAPseq From line 18 of rules/mapseq.smk

shell:
    "scripts/tomat.py -b {input.out} -t {input.db} 2> {log}"

SnakeMake From line 37 of rules/mapseq.smk

shell:
    """
    fasta-splitter --n-parts {params.n_chunks} {input} \
        --nopad --out-dir {params.out_dir}
    """

SnakeMake From line 20 of rules/megablast.smk

shell:
    """
    blastn -task 'megablast' -db {input.db} -evalue {params.eval} \
        -perc_identity {params.pident} -max_target_seqs {params.nseqs} \
        -query {input.fasta} -out {output.all} -outfmt 6 \
        2> {log} 
    """

SnakeMake BLAST From line 46 of rules/megablast.smk

    shell:
        """
	cat {input.out} | awk -F '\\t' -v l={params.alnlen} \
            '{{if ($4 > l) print $0}}' \
            > {output.out}
        cat {input.benchm} | grep -P "^[0-9]" | \
          awk '{{sum+=$1}} END {{print "s"; print sum / {params.threads}}}' \
          > {output.benchm} 2> {log}
        """

SnakeMake From line 76 of rules/megablast.smk

shell:
  	"""
    scripts/tolca.py -b {input.blast} -t {input.db} \
         -l {output} -c {params.lcacons} > {log}
    """

SnakeMake From line 102 of rules/megablast.smk

shell:
    "scripts/tomat.py -l {input} 2> {log}"

SnakeMake From line 121 of rules/megablast.smk

shell:
    """
    minimap2 {params.extra} -t {threads} -N {params.nseqs} \
      {input.target} {input.query} -o {output} 2> {log}
    """

SnakeMake Minimap2 From line 20 of rules/minimap.smk

shell:
    """
    samtools sort -@ {threads} {input} | \
    samtools view -@ {threads} | \
    cut -f 1,3 > {output} 2> {log}
    """

SnakeMake SAMtools From line 39 of rules/minimap.smk

    shell:
        """
	scripts/tolca.py -b {input.mm} -t {input.db} \
             -l {output} -c {params.lcacons} > {log}
        """

SnakeMake From line 61 of rules/minimap.smk

shell:
    "scripts/tomat.py -l {input} 2> {log}"

SnakeMake From line 81 of rules/minimap.smk

shell:
    """
    scripts/todb.py -s {input.aln} -t {input.tax} -m mothur \
        -S {output.aln} -T {output.tax} 2> {log}
    """

SnakeMake mothur From line 17 of rules/mothur.smk

shell:
    """
    mkdir -p {output.dir}
    cp {input} {output.dir}
    queryf=$(basename {input.query})
    alnf=$(basename {input.aln})
    echo \"align.seqs(candidate={output.dir}/$queryf, template={output.dir}/$alnf, \
        processors={threads}, ksize=6, align=needleman); -q\" > {output.dir}/mothur.cmd
    mothur {output.dir}/mothur.cmd 2> {log}
    awk -F '\\t' -v OFS='\\t' '{{if (NR==1) printf "%s","#"; print $1, $3}}' \
      {output.dir}/{params.file} > {output.out} 2>> {log} 
    """

SnakeMake mothur From line 42 of rules/mothur.smk

shell:
    "scripts/tomat.py -b {input.out} -t {input.tax} 2> {log}"

SnakeMake From line 71 of rules/mothur.smk

shell:
    """
    porechop --input {input} \
      --output {output} \
      --threads {threads} \
      --check_reads {params.check_reads} \
      --discard_middle > {log}
    """

SnakeMake porechop From line 29 of rules/preprocess.smk

shell:
    """
    gzip -d -c {input} | \
      NanoFilt --quality {params.quality} --length {params.min_len} --maxlength {params.max_len} | \
      gzip > {output} 2> {log}
    """

SnakeMake nanofilt From line 57 of rules/preprocess.smk

shell:
    """
    seqtk sample -s {params.seed} {input} {params.n} | \
        gzip -c > {output}
    """

SnakeMake seqtk From line 81 of rules/preprocess.smk

shell:
    "zcat < {input} | sed -n '1~4s/^@/>/p;2~4p' > {output}"

SnakeMake From line 97 of rules/preprocess.smk

shell:
    """
    pistis --fastq {input} --output {output} \
      --downsample {params.downsample} 2> {log}
    """

SnakeMake From line 121 of rules/preprocess.smk

shell:
    """
    NanoStat --fastq {input} --name {output} \
      --threads {threads} 2> {log}
    """

SnakeMake nanostat From line 143 of rules/preprocess.smk

shell:
    """
    qiime tools import \
        --type 'FeatureData[Sequence]' \
        --input-path {input.fasta} \
        --output-path {output} \
        2>&1 | tee -a {log}
    """

SnakeMake QIIME2.0 From line 14 of rules/qiime.smk

shell:
    """
    qiime tools import \
        --type 'FeatureData[Sequence]' \
        --input-path {input} \
        --output-path {output} \
        2>&1 | tee -a {log}
    """

SnakeMake QIIME2.0 From line 35 of rules/qiime.smk

shell:
    """
    qiime tools import \
        --type FeatureData[Taxonomy] \
        --input-format HeaderlessTSVTaxonomyFormat \
        --input-path {input} \
        --output-path {output} \
        2>&1 | tee -a {log}
    """

SnakeMake QIIME2.0 From line 56 of rules/qiime.smk

shell:
    """
    qiime feature-classifier classify-consensus-vsearch \
        --i-query {input.query} \
        --i-reference-reads {input.ref_seqs} \
        --i-reference-taxonomy {input.taxo} \
        --o-classification {output.qza} \
        --p-threads {threads} \
        --p-perc-identity {params.pident} \
        --p-maxaccepts {params.nseqs} \
        --p-min-consensus {params.lcacons} \
        2>&1 | tee -a {log}
    qiime tools export \
        --input-path {output.qza} \
        --output-path {output.path} 2>&1 | tee -a {log}
    mv {output.path}/taxonomy.tsv {output.out} 2>> {log}
    """

SnakeMake QIIME2.0 From line 87 of rules/qiime.smk

shell:
    """
    cut -f 1,2 {input} | grep -P -v '\\tUnassigned$' | \
    awk -F '\\t|;' -v OFS='\\t' '{{NF=7}}1' | \
    awk 'BEGIN {{ FS = OFS = "\\t" }} {{ for(i=1; i<=NF; i++) if($i == "") $i = "NA" }}; 1' | \
    sed "s/D_[[:digit:]]__//g" | \
    sed "s/^Feature ID.*/\#readid\\tDomain\\tPhylum\\tClass\\tOrder\\tFamily\\tGenus/" \
    > {output} 2> {log}
    """

SnakeMake From line 118 of rules/qiime.smk

shell:
    "scripts/tomat.py -l {input.list} 2> {log}"

SnakeMake From line 139 of rules/qiime.smk

shell:
    """
    scripts/todb.py -s {input.seq} -t {input.tax} -m rdp -S tmp.seq -T {output.tax} 2> {log}
    gzip -c tmp.seq > {output.seq} && rm tmp.seq 2>> {log}
    """

SnakeMake From line 15 of rules/rdp.smk

shell:
    "Rscript scripts/assigntaxonomy.R {input.db} {input.query} {output} {threads} {params} 2> {log}"

SnakeMake From line 40 of rules/rdp.smk

shell:
    "scripts/tomat.py -l {input.list} 2> {log}"

SnakeMake From line 57 of rules/rdp.smk

shell:
    """
    scripts/todb.py -s {input.seq} -t {input.tax} -m spingo \
        -S {output.seq} -T {output.tax} 2> {log}
    spindex -k 8 -p {threads} -d {output.seq} 2>> {log}
    """

SnakeMake spingo From line 18 of rules/spingo.smk

shell:
    """
    spingo -d {input.db} -k 8 -a \
      -p {threads} -i {input.fasta} \
      > {output.tmp} 2> {log}
    sed 's/ /\\t/' {output.tmp} | \
    awk -F '\\t' -v OFS='\\t' '{{
        if (NR == 1)
            print "#readid","Domain","Phylum","Class","Order","Family","Genus";
        else
            print $1, $4, $6, $8, $10, $12, $14
    }}' > {output.out} 2> {log}
    """

SnakeMake spingo From line 43 of rules/spingo.smk

shell:
    "scripts/tomat.py -l {input.list} 2> {log}"

SnakeMake From line 71 of rules/spingo.smk

rm(list=ls())

suppressPackageStartupMessages(library(dada2))
suppressPackageStartupMessages(library(seqinr))

args = commandArgs(trailingOnly=TRUE)
threads = as.numeric(args[4])
pident = as.numeric(args[5])

query <- read.fasta(args[2], as.string = T, forceDNAtolower = F, set.attributes = F)

set.seed(100)
taxann <- assignTaxonomy(seqs =  unlist(query), refFasta = args[1], 
          tryRC = T, outputBootstraps=F, minBoot = pident, multithread = threads, verbose = F, 
          taxLevels = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")
          )

# remove lengthy seqs from row names
queryid <- names(strsplit(rownames(taxann), " "))
taxodf <- as.data.frame(taxann)
for (i in 1:length(taxodf)) taxodf[,i] <- gsub('_', ' ', taxodf[,i])
names(taxodf)[1] <- "Domain"
taxodf$"#readid" <- queryid
write.table(taxodf[,c(7,1:6)], file = args[3], row.names=F, col.names=T, quote=F, sep='\t')

R dada2 seqinr From line 3 of scripts/assigntaxonomy.R

suppressPackageStartupMessages(library(phyloseq))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(vroom))
suppressPackageStartupMessages(library(tidyr))

args = commandArgs(trailingOnly=TRUE)

taxmat <- as.data.frame(unique(vroom(delim = '\t', args)))
taxmat <- taxmat[!is.na(taxmat$Domain),]

taxmat <- unique(taxmat)
names(taxmat)[1] <- "taxid"
write.table(taxmat, file="tables/taxonomy-table.tsv", row.names=F, col.names=T, sep='\t', quote=F)

rownames(taxmat) <- taxmat$taxid
taxmat$taxid <- NULL
taxmat <- as.matrix(taxmat)

file <- gsub(".taxmat$", ".otumat", args)
sam <- data.frame(run = rep(NA, length(file)), sample = rep(NA, length(file)), method = rep(NA, length(file)))

for (i in 1:length(file)){
  otumatje <- read.table(file[i], header = T, sep = '\t', comment = "")
  names(otumatje)[1] <- "taxid"
  names(otumatje)[2] <- paste0(strsplit(file[i], "/")[[1]][2],"_",names(otumatje)[2])
  ifelse (i == 1, otumat <- otumatje, otumat <- merge(otumat, otumatje, all=TRUE))

  lab = strsplit(file[i], "/")[[1]][4]
  sam$run[i] = strsplit(file[i], "/")[[1]][2]
  sam$sample[i] = strsplit(lab, "[.]")[[1]][1]
  sam$method[i] = strsplit(lab, "[.]")[[1]][2]
} 

otumat[is.na(otumat)] <- 0

## for some custom DB like BOLD local copy, the same taxonomic lineage may occur 2x.
## This causes an error when they are used as rownames, which should be unique --> fix: aggregate
otumat <- aggregate(otumat[2:length(otumat)], by=list(otumat$taxid), sum)
names(otumat)[1] <- "taxid"
write.table(otumat, file="tables/otu-table.tsv", row.names=F, col.names=T, sep='\t', quote=F)

rownames(otumat) <- otumat$taxid
otumat$taxid <- NULL
otumat <- as.matrix(otumat)

rownames(sam) <- colnames(otumat)


OTU = otu_table(otumat, taxa_are_rows = TRUE)
TAX = tax_table(taxmat)
SAM = sample_data(sam)

physeq = phyloseq(OTU, TAX, SAM)
pphyseq  = transform_sample_counts(physeq, function(x) x / sum(x) )

theme_set(theme_bw())

for (level in colnames(taxmat)){
    top.taxa <- tax_glom(physeq, level)
    TopNOTUs <- names(sort(taxa_sums(top.taxa), TRUE)[1:20])
    BottumNOTUs <- names(taxa_sums(top.taxa))[which(!names(taxa_sums(top.taxa)) %in% TopNOTUs)]
    merged_physeq = merge_taxa(top.taxa, BottumNOTUs, 2)

    mdf = psmelt(merged_physeq); names(mdf)[names(mdf) == level] <- "level"
    mdf$OTU[which(is.na(mdf$level))] <- "aaaOther"
    mdf$level[which(is.na(mdf$level))] <- "aaaOther"
    aggr_mdf <- aggregate(Abundance ~ sample + run + method + level, data = mdf, sum)

    labs = aggr_mdf$level; labs[labs=="aaaOther"] <- "Other"
    cols = scales::hue_pal()(length(unique(labs))); cols[unique(labs) == "Other"] <- "#CCCCCC"

    p = ggplot(aggr_mdf, aes_string(x = "method", y = "Abundance", fill = "level"))
    p = p + scale_fill_manual(name = level, labels = unique(labs), values = cols)
    p = p + facet_grid(paste0("", aggr_mdf$run) ~ paste0("", aggr_mdf$sample))
    p = p + geom_bar(stat = "identity", position = "stack",  color = "black", size = 0.1)
    p = p + guides(fill=guide_legend(ncol=1))
    p = p + labs(x = "Method", y = "Absolute abundance")
    p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0, size = 5))

    ggsave(paste0("plots/aabund-", level, "-by-sample.pdf"), plot=p, device="pdf")

    p = ggplot(aggr_mdf, aes_string(x = "sample", y = "Abundance", fill = "level")) 
    p = p + scale_fill_manual(name = level, labels = unique(labs), values = cols)
    p = p + facet_grid(paste0("", aggr_mdf$run) ~ paste0("", aggr_mdf$method))
    p = p + geom_bar(stat = "identity", position = "stack",  color = "black", size = 0.1)
    p =	p + guides(fill=guide_legend(ncol=1))  
    p = p + labs(x = "Sample", y = "Absolute abundance")
    p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0, size = 5))

    ggsave(paste0("plots/aabund-", level, "-by-method.pdf"), plot=p, device="pdf")
}


for (level in colnames(taxmat)){
    top.taxa <- tax_glom(pphyseq, level)
    TopNOTUs <- names(sort(taxa_sums(top.taxa), TRUE)[1:20])
    BottumNOTUs <- names(taxa_sums(top.taxa))[which(!names(taxa_sums(top.taxa)) %in% TopNOTUs)]
    merged_physeq = merge_taxa(top.taxa, BottumNOTUs, 2)

    mdf = psmelt(merged_physeq); names(mdf)[names(mdf) == level] <- "level"
    mdf$OTU[which(is.na(mdf$level))] <- "aaaOther"
    mdf$level[which(is.na(mdf$level))] <- "aaaOther"
    aggr_mdf <- aggregate(Abundance ~ sample + run + method + level, data = mdf, sum)

    labs = aggr_mdf$level; labs[labs=="aaaOther"] <- "Other"
    cols = scales::hue_pal()(length(unique(labs))); cols[unique(labs) == "Other"] <- "#CCCCCC"

    p = ggplot(aggr_mdf, aes_string(x = "method", y = "Abundance", fill = "level"))
    p = p + scale_fill_manual(name = level, labels = unique(labs), values = cols)
    p = p + facet_grid(paste0("", aggr_mdf$run) ~ paste0("", aggr_mdf$sample))
    p = p + geom_bar(stat = "identity", position = "stack",  color = "black", size = 0.1)
    p =	p + guides(fill=guide_legend(ncol=1))  
    p = p + labs(x = "Method", y = "Absolute abundance")
    p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0, size = 5))

    ggsave(paste0("plots/rabund-", level, "-by-sample.pdf"), plot=p, device="pdf")

    p = ggplot(aggr_mdf, aes_string(x = "sample", y = "Abundance", fill = "level"))
    p = p + scale_fill_manual(name = level, labels = unique(labs), values = cols)
    p = p + facet_grid(paste0("", aggr_mdf$run) ~ paste0("", aggr_mdf$method))
    p = p + geom_bar(stat = "identity", position = "stack",  color = "black", size = 0.1)
    p =	p + guides(fill=guide_legend(ncol=1))  
    p = p + labs(x = "Sample", y = "Relative abundance")
    p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0, size = 5))

    ggsave(paste0("plots/rabund-", level, "-by-method.pdf"), plot=p, device="pdf")
}

R ggplot2 tidyr vroom phyloseq From line 3 of scripts/barplot.R

rm(list=ls())

suppressPackageStartupMessages(library("DECIPHER"))

args = commandArgs(trailingOnly=TRUE)

threads = as.numeric(args[4])

pctth = as.numeric(args[5])

query <- readDNAStringSet(args[2])
db <- load(args[1])

taxann <- IdTaxa(query, trained_classifiyer, strand = "both", threshold = pctth, processors = threads, verbose = F)

assignment <- sapply(taxann, function(x) paste(x$taxon, collapse="\t"))
names(assignment) <- sapply(strsplit(names(assignment)," "), `[`, 1)

dl <- lapply(assignment, function (x) {gsub("D_[0-6]__", "", x)})
dl <- lapply(dl, function (x) {gsub("Root\t", "", x)})

df <- setNames(as.data.frame(unlist(dl)),"Domain\tPhylum\tClass\tOrder\tFamily\tGenus")
df$'#readid' <- rownames(df)

write.table(df[,c(2,1)], file=args[3], row.names=F, col.names=T, quote=F, sep='\t')

R DECIPHER From line 3 of scripts/idtaxa.R

rm(list=ls())

suppressPackageStartupMessages(library("DECIPHER"))

args = commandArgs(trailingOnly=TRUE)

# train classifyer
refseqs <- readDNAStringSet(args[1])
taxo = read.table(args[2], sep = "\t", comment.char = "", quote="" )

taxonomy <- paste0(taxo$V1, " Root;", taxo$V2)
trained_classifiyer <- LearnTaxa(train = refseqs, taxonomy = taxonomy, verbose = T)

save(trained_classifiyer, file = args[3])

R DECIPHER From line 3 of scripts/learntaxa.R

suppressPackageStartupMessages(library(ggplot2))

args = commandArgs(trailingOnly=TRUE)

dl = list()
for (i in 1:length(args)){
  # Get metadata
  lab = strsplit(args[i], "/")[[1]][4]
  run = strsplit(args[i], "/")[[1]][2]
  sample = strsplit(lab, "[.]")[[1]][1]
  method = strsplit(lab, "[.]")[[1]][2]
  # Read data and compute total prop precision
  tbl <- read.table(args[i], header = T, sep = '\t', 
                    comment = "", row.names = 1)
  dt <- as.data.frame(colSums(tbl, na.rm=T)/
                      colSums(!is.na(tbl)))
  names(dt) <- "precision"
  # Add meta data columns
  dt$level <- row.names(dt)
  dt$rank <- factor(dt$level, ordered=T, 
      levels=c("Domain","Phylum","Class",
               "Order","Family","Genus"))
  dt$run <- rep(run, nrow(dt))
  dt$sample <- rep(sample, nrow(dt))
  dt$method <- rep(method, nrow(dt))
  # Add to list
  dl[[i]] <- dt
}

df <- do.call(rbind, dl)

theme_set(theme_bw())
p = ggplot(df, aes(x = rank, y = precision, 
           group = method, col = method)) + 
    geom_line() + geom_point() +
    labs(x = "", y = "Precision") + ylim(0,1) +
    theme(axis.text.x = element_text(
        angle = 90, vjust = 0.5, hjust=1))
p = p + facet_grid(paste0("Run: ",run) ~ 
                   paste0("",sample))
ggsave("plots/precision.pdf", plot=p, device="pdf") 

R ggplot2 From line 3 of scripts/lineplot.R

suppressPackageStartupMessages(library(ggplot2))

args = commandArgs(trailingOnly=TRUE)

dl = list()
for (i in 1:length(args)){
  # Get metadata
  lab = strsplit(args[i], "/")[[1]][3]
  run = strsplit(args[i], "/")[[1]][2]
  sample = strsplit(strsplit(lab, "[_]")[[1]][3], "[.]")[[1]][1]
  method = strsplit(lab, "[_]")[[1]][1]
  # Read data
  dt <- as.data.frame(read.table(args[i], header = T, sep = '\t',
                    comment = "")[,1])
  # Add meta data columns
  dt$run <- rep(run, nrow(dt))
  dt$sample <- rep(sample, nrow(dt))
  dt$method <- rep(method, nrow(dt))
  # Add to list
  dl[[i]] <- dt
}

df <- do.call(rbind, dl)
names(df)[1] <- "s"

theme_set(theme_bw())

p1 = ggplot(df, aes(x = method, y = s, fill=method)) +
    geom_bar(stat="identity", color="black") +
    labs(x = "", y = "Runtime (s)") +
    theme(axis.text.x = element_text(
        angle = 90, vjust = 0.5, hjust=1))
p1 = p1 + facet_grid(paste0("Run: ",run) ~
                   paste0("Sample: ",sample))
p1 = p1 + theme(legend.position="none")
ggsave("plots/runtime-by-sample.pdf", plot=p1, device="pdf")

q1 = ggplot(df, aes(x = method, y = s, fill=method)) +
    geom_bar(stat="identity", color="black") +
    scale_y_continuous(trans = "log10") +
    labs(x = "", y = "Runtime (s)") +
    theme(axis.text.x = element_text(
        angle = 90, vjust = 0.5, hjust=1))
q1 = q1 + facet_grid(paste0("Run: ",run) ~
                   paste0("Sample: ",sample))
q1 = q1 + theme(legend.position="none")
ggsave("plots/runtime_log-by-sample.pdf", plot=q1, device="pdf")


p2 = ggplot(df, aes(x = sample, y = s, fill=sample)) +
    geom_bar(stat="identity", color="black") +
    labs(x = "", y = "Runtime (s)") +
    theme(axis.text.x = element_text(
        angle = 90, vjust = 0.5, hjust=1))
p2 = p2 + facet_grid(paste0("Run: ",run) ~
                   paste0("Method: ",method))
p2 = p2 + theme(legend.position="none")
ggsave("plots/runtime-by-method.pdf", plot=p2, device="pdf")

q2 = ggplot(df, aes(x = sample, y = s, fill=sample)) +
    geom_bar(stat="identity", color="black") +
    scale_y_continuous(trans = "log10") +
    labs(x = "", y = "Runtime (s)") +
    theme(axis.text.x = element_text(
        angle = 90, vjust = 0.5, hjust=1))
q2 = q2 + facet_grid(paste0("Run: ",run) ~
                   paste0("Method: ",method))
q2 = q2 + theme(legend.position="none")
ggsave("plots/runtime_log-by-method.pdf", plot=q2, device="pdf")

R ggplot2 From line 3 of scripts/timeplot.R

__author__ = "Evelien Jongepier"
__copyright__ = "Copyright 2020, Evelien Jongepier"
__email__ = "e.jongepier@uva.nl"
__license__ = "MIT"

import sys
import re
import argparse
import collections


class ToConsensus(object):

    def __init__(self, taxlists, consensustax, pconsensus):
        self.taxlists = taxlists
        self.consensustax = consensustax
        self.pconsensus = pconsensus
        return


    def taxlist_per_sample_per_run_dict(self):
        '''
        Example outout: {run : {sample1 : [path1, path2, ...],
          sample2 : [path3, path4, ...], ...}
        Stores the path to each of the input taxlists for each
        of the samples for each of the runs.
        '''
        sampledict = collections.defaultdict(
          dict = collections.defaultdict(list))
        for idx, taxlist in enumerate(self.taxlists):
            run = taxlist.split('/')[1]
            sample = taxlist.split('/')[3].split('.')[0]
            if run not in sampledict:
                sampledict[run] = {sample:[taxlist]}
            elif sample not in sampledict[run]:
                sampledict[run][sample] = [taxlist]
            else:
                sampledict[run][sample].append(taxlist)
        return sampledict

    def run_all_runs_n_samples(self):
        '''
        '''
        sampledict = self.taxlist_per_sample_per_run_dict()
        for run in sampledict:
            for sample in sampledict[run]:
                ftaxlists = sampledict[run][sample]
                samplereaddict = self.get_read_dict_per_sample(ftaxlists)
                nlists = len(ftaxlists)
                consensusreaddict = self.comp_consensus_tax(samplereaddict, nlists)
                for ftaxlist in ftaxlists:
                    precisiondict = self.comp_precision_per_list(consensusreaddict, ftaxlist)
                    path = self.get_output_path(ftaxlist)
                    self.write_precision_tax(precisiondict, path)
                #self.write_consensus_tax(
        return


    def get_output_path(self, ftaxlist):
        dir = '/'.join(ftaxlist.split('/')[:3])
        method = ftaxlist.split('/')[2]
        sample = ftaxlist.split('/')[3].split('.')[0]
        outpath = dir + '/' + sample + '.' + method + '.precision'
        return outpath


    def get_read_dict_per_list(self, taxlist):
        '''
        Example output: readdict = {read_id: {Domain:tool1},
           {Phylum:tool1},{Class:tool1},...}
        Stores tax classification of a single taxlist (i.a the taxo-
        nomic classification based on a single tool) into a dict
        of dict. Outer dict is readid as key and dict of tax as
        value. Inner / nested dicts have tax level is a key with
        tax annotation at that level for that read in that taxlist
        as str.
        '''
        readdict = collections.defaultdict(dict)
        with open (taxlist) as f:
            for l in f.readlines():
                if l.startswith("#"):
                    headers = l.strip().split('\t')[1:]
                else:
                    readid = l.strip().split('\t')[0]
                    taxlist = l.strip().split('\t')[1:7]
                    readdict[readid] = dict(
                        (headers[idx], tax) for idx, tax in enumerate(taxlist))
        return readdict

    def get_read_dict_per_sample(self, ftaxlists):
        '''
        Example output: readdict = {read_id: {Domain:[tool1, tool2,
           tool3,...]},{Phylum:[tool1, tool2,tool3,...]},...}
        Stores tax classification of all taxlists (i.a the taxo-
        nomic classification based on a all tools) into a dict
        of dict of lists. Outer dict is readid as key and dict of tax
        as value. Inner / nested dicts have tax level is a key with
        tax annotation at that level for that read in each of the
        taxlist (thus tools) as list.
        '''
        samplereaddict = collections.defaultdict(
          dict = collections.defaultdict(list))
        for taxlist in ftaxlists:
            readdict = self.get_read_dict_per_list(taxlist)
            for readid in readdict:
                if readid not in samplereaddict:
                    samplereaddict[readid] = readdict[readid]
                    # change tax into list so all taxlists can be appended
                    for level in readdict[readid]:
                        samplereaddict[readid][level] = [samplereaddict[readid][level]]
                else:
                    for level in readdict[readid]:
                        tax = readdict[readid][level]
                        samplereaddict[readid][level].append(tax)
        return samplereaddict


    def comp_consensus_tax(self, samplereaddict, nlists):
        '''
        Example output: {readid = {Domain:cons},{Phylum:cons},...}
        Compute consensus for each read at each taxonomic level
        based on user defined level (0.5 = majority) and the taxo-
        nomic classificatons of each tool.
        '''
        consensusreaddict = collections.defaultdict(dict)

        for readid in samplereaddict:
            for level in samplereaddict[readid]:
                freqdict = collections.Counter(samplereaddict[readid][level])
                maxkey = max(freqdict, key=freqdict.get)
                if (freqdict[maxkey] / float(nlists)) > self.pconsensus:
                    consensusreaddict[readid][level] = maxkey
                else:
                    consensusreaddict[readid][level] = "NA"
        return consensusreaddict


    def comp_precision_per_list(self, consensusreaddict, ftaxlist):
        '''
        Example output: {readid = {Domain:bool}, {Phylum:bool}, ...}
        For each read and each taxomic level, a boolian whether
        the classification matches the consesnus clasification or
        not.
        '''
        readdict = self.get_read_dict_per_list(ftaxlist)
        precisiondict = collections.defaultdict(dict)
        for readid in readdict:
            for level in readdict[readid]:
                if readdict[readid][level] == 'NA' or consensusreaddict[readid][level] == 'NA':
                    precisiondict[readid][level] =  'NA'
                elif readdict[readid][level] == consensusreaddict[readid][level]:
                    precisiondict[readid][level] = '1'
                else:
                    precisiondict[readid][level] = '0'
        return precisiondict


    def write_consensus_tax(self):
        '''
        Example out: "readid \t Domain \t Phylum ..."
        Write consensus to file, incl. header line.
        '''
        consensusreaddict = self.comp_consensus_tax()
        levels = ['Domain','Phylum','Class','Order','Family','Genus']
        with open(self.consensustax, 'w') as f:
            f.write('#readid' + '\t' + '\t'.join(levels) + '\n')
            for readid in consensusreaddict:
                tax = []
                for level in levels:
                    tax.append(consensusreaddict[readid][level])
                f.write(readid + '\t' + '\t'.join(tax) + '\n')
        return


    def write_precision_tax(self, precisiondict, path):
        '''
        Example out: "readid \t Domain \t Phylum ..."
        Write precision (0/1) to file, incl. header line.
        '''
        levels = ['Domain','Phylum','Class','Order','Family','Genus']
        with open(path, 'w') as f:
            f.write('#readid' + '\t' + '\t'.join(levels) + '\n')
            for readid in precisiondict:
                tax = []
                for level in levels:
                    tax.append(precisiondict[readid][level])
                f.write(readid + '\t' + '\t'.join(tax) + '\n')
        return


def main(taxlists, consensustax, pconsensus):
    m = ToConsensus(
        taxlists = taxlists,
        consensustax = consensustax,
        pconsensus = pconsensus
    )
    m.run_all_runs_n_samples()
    return


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Obtain consensus and '
                        'majority taxonomic classifications based on a '
                        'list of taxlists as generated by various classi'
                        'fication tools. Taxlists can be obtained with '
                        'tomat.py')
    parser.add_argument('-l', '--taxlists',  nargs='+', default=None,
                        help='List of paths to taxlists, which contain '
                        'an taxonomic identifier in the first column followed '
                        'by tab-delimited 6-level taxonomic classifications '
                        'sensu Silva.')
    parser.add_argument('-o', '--consensustax', default=None,
                        help='Paths to where output consensus taxonomy '
                        'should be written.')
    parser.add_argument('-c', '--pconsensus', type=float, default=0.5,
                        help='Consensus treshold ranging between 0.5 and 1. '
                        'Minimum fraction of studies that need to show '
                        'consensus about each taxonomic classification at '
                        'each taxonomic level. Majority obtained with 0.5. '
                        'Taxonomic annotations for which there is no consensus '
                        'are returned as NA.')
    args = parser.parse_args()

    assert (len(args.taxlists) > 2), '''
        \n Please provide at least three taxlists to determine
        consensus taxonomy. Read the help (--help) for further information.
        '''

    main(
        taxlists = args.taxlists,
        consensustax = args.consensustax,
        pconsensus = args.pconsensus
        )

Python Consensus From line 3 of scripts/toconsensus.py

__author__ = "Evelien Jongepier"
__copyright__ = "Copyright 2020, Evelien Jongepier"
__email__ = "e.jongepier@uva.nl"
__license__ = "MIT"

import sys
import re
import argparse
import collections
from Bio import SeqIO


class ToDb(object):

    def __init__(self, refseq, reftax, method, outseq, outtax):
        self.refseq = refseq
        self.reftax = reftax
        self.method = method
        self.outseq = outseq
        self.outtax = outtax
        return


    def get_ref_dict(self):
        '''
        Example out: {id1:"domain1;phylum1;...", id2:"domain2;phylum2;..."}
        Create a dictionary with sequence id as key and taxonomic string as
        value. The exact formatting of the taxonomic string differs between
        methods.
        '''
        refdict = {}
        with open(self.reftax) as t:
            for line in t.readlines():
                id, tax = line.strip().split('\t')[:2]
                tax = re.sub(r'D_[0-6]__', '', tax)
                tax = re.sub(r' ', '_', tax)
                if self.method == 'centrifuge':
                    refdict[id] = tax
                else:
                    tax = tax.split(';')[:6]
                    if self.method == 'mothur':
                        refdict[id] = ';'.join(tax)
                    if self.method == 'rdp':
                        refdict[id] = ';'.join(tax)
                    if self.method == 'spingo':
                        tax.reverse()
                        refdict[id] = '\t'.join(tax)
        return refdict


    def get_out_files(self):
        '''
        Reformat fasta header to make it compatible with the selected method.
        '''
        with open(self.outseq, "w") as outs, open(self.outtax, "w") as outt:
            refdict = self.get_ref_dict()
            for record in SeqIO.parse(open(self.refseq), 'fasta'):
                if record.id in refdict:
                    record.description = ''
                    outt.write(record.id + '\t' + refdict[record.id] + '\n')
                    if self.method == 'mothur':
                        record.id = record.id + '\tNA\t' + refdict[record.id]
                    if self.method == 'rdp':
                        record.id = refdict[record.id] + ';'
                    if self.method == 'spingo':
                        record.id = record.id + "]\t" + refdict[record.id]
                    if self.method == 'centrifuge':
                        record.id = record.id + ' ' + refdict[record.id]
                    SeqIO.write(record, outs, "fasta")



def main(refseq, reftax, method, outseq, outtax):
    m = ToDb(
        refseq = refseq,
        reftax = reftax,
        method = method,
        outseq = outseq,
        outtax = outtax
    )
    m.get_out_files()
    return


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Convert qiime2-formatted '
                        'database, consisting of a sequence file and a 7-level '
                        'taxonomy file, into database file(s) compatible with '
                        'other taxonomic classification methods.')
    parser.add_argument('-s', '--refseq',  default=None,
                        help='Path to fasta file of reference nucleotide sequences. '
                        'Headers should contain only the sequence identifier.')
    parser.add_argument('-t', '--reftax', default=None,
                        help='Path to tab-separated table with reference taxonomy. '
                        'The first column should contain the sequence identifier '
                        'as given in "refseq", the second column a semi-colon-'
                        'separated 7-level taxonomy.')
    parser.add_argument('-m', '--method', default=None,
                        help='A string specifying for the which taxonomic classi'
                        'fication tool the output db files should be re-formatted. '
                        'Legit values are "mothur", "centrifuge", "rdp" or "spingo".')
    parser.add_argument('-S', '--outseq',  default=None,
                        help='Path to output fasta file with re-formatted sequences.')
    parser.add_argument('-T', '--outtax', default=None,
                        help='Path to tab separated output table with re-formatted '
                        'taxonomy.')


    args = parser.parse_args()

    main(
        refseq = args.refseq,
        reftax = args.reftax,
        method = args.method,
        outseq = args.outseq,
        outtax = args.outtax
        )

Python Biopython mothur Centrifuge spingo From line 3 of scripts/todb.py

__author__ = "Evelien Jongepier"
__copyright__ = "Copyright 2021, Evelien Jongepier"
__email__ = "e.jongepier@uva.nl"
__license__ = "MIT"

import sys
import re
import argparse
import collections



class ToLCA(object):

    def __init__(self, blast, taxonomy, lca, pconsensus):
        self.blast = blast
        self.taxonomy = taxonomy
        self.lca = lca
        self.pconsensus = pconsensus
        self.levels = ["Domain","Phylum","Class","Order","Family","Genus"]
        return


    def get_blastdict(self):
        '''
        Example output: readdict = {read_id: [hitid1, hitid2, 
            hitid3],...}
        Stores list of blast hit id for each read in dict.
        '''
        readdict = collections.defaultdict(list)
        with open (self.blast) as f:
            for l in f.readlines():
                readid = l.strip().split('\t')[0]
                hitid = l.strip().split('\t')[1]
                if readid not in readdict:
                    readdict[readid] = [hitid]
                else:
                    readdict[readid].append(hitid)
        return readdict


    def get_dbdict(self):
        '''
        Example output: dbdict = {dbid: [domain, order,
            class, ...], ...}
        Stores list of taxonomic annotations (i.e. 7 levels)
        for each refernce in the db.
        '''
        dbdict = collections.defaultdict(list)
        with open (self.taxonomy) as f:
            for l in f.readlines():
                dbid, dbtaxstr = l.strip().split('\t')[:2]
                dbtaxstr = re.sub(r'D_[0-6]__', '', dbtaxstr)
                dbtax = dbtaxstr.split(';')[:6]
                dbdict[dbid] = dbtax
        return dbdict


    def get_classdict(self):
        '''
        Example output: classdict = {readid: {domain: [hit1, 
            hit2, hit3, ...], ...}, ...}
        Stores list of classifications for each of the 7
        levels of the taxonomy for each dbhit of each read.
        Redundant hits, not yet consensus.
        '''

        classdict = collections.defaultdict(
            dict = collections.defaultdict(list)
        )

        readdict = self.get_blastdict()
        dbdict = self.get_dbdict()

        for readid in readdict:
              for hitid in readdict[readid]:
                   dbtaxs = dbdict[hitid]

                   if readid not in classdict:
                       classdict[readid] = dict(
                           (self.levels[idx], [tax]) for idx, tax in enumerate(dbtaxs))
                   else:
                       for idx, tax in enumerate(dbtaxs):
                           classdict[readid][self.levels[idx]].append(tax)
        return classdict


    def comp_lca(self):
        '''
        Example output: {readid = {Domain:cons},{Phylum:cons},...}
        Compute consensus for each read at each taxonomic level
        based on user defined level (0.5 = majority) and the taxo-
        nomic classificatons of each hit.
        '''

        lcadict = collections.defaultdict(dict)
        classdict = self.get_classdict()

        for readid in classdict:
            for level in classdict[readid]:
                freqdict = collections.Counter(classdict[readid][level])
                totfreq = sum(freqdict.values())
                maxkey = max(freqdict, key=freqdict.get)
                if (freqdict[maxkey] / float(totfreq)) > self.pconsensus:
                    lcadict[readid][level] = maxkey
                else:
                    lcadict[readid][level] = "NA"
        return lcadict


    def write_lca(self):
        '''
        Example out: "readid \t Domain \t Phylum ..."
        Write consensus to file, incl. header line.
        '''
        lcadict = self.comp_lca()
        with open(self.lca, 'w') as f:
            f.write('#readid' + '\t' + '\t'.join(self.levels) + '\n')
            for readid in lcadict:
                tax = []
                for level in self.levels:
                    tax.append(lcadict[readid][level])
                f.write(readid + '\t' + '\t'.join(tax) + '\n')
        return


def main(blast, taxonomy, lca, pconsensus):
    m = ToLCA(
        blast = blast,
        taxonomy = taxonomy,
        lca = lca,
        pconsensus = pconsensus
    )
    m.write_lca()
    return      


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Obtain LCA based on blast '
                        'tabular and user defined consensus treshold.')
    parser.add_argument('-b', '--blast', default=None,
                        help='Path to blast tabular with read id and reference '
                        'hit id.')
    parser.add_argument('-t', '--taxonomy',  default=None,
                        help='Path to taxonomy with reference id and 7-level '
                        'taxonomy sensu Silva.')
    parser.add_argument('-l', '--lca', default=None,
                        help='Path to where output lca consensus taxonomy '
                        'should be written.')
    parser.add_argument('-c', '--pconsensus', type=float, default=0.5,
                        help='Consensus treshold ranging between 0.5 and 0.99. '
                        'Minimum fraction of studies that need to show '
                        'consensus about each taxonomic classification at '
                        'each taxonomic level. Majority obtained with 0.5. '
                        'Taxonomic annotations for which there is no consensus '
                        'are returned as NA.')
    args = parser.parse_args()

    main(
	blast = args.blast,
        taxonomy = args.taxonomy,
        lca = args.lca,
        pconsensus = args.pconsensus
        )

Python BLAST Consensus From line 3 of scripts/tolca.py

__author__ = "Evelien Jongepier"
__copyright__ = "Copyright 2020, Evelien Jongepier"
__email__ = "e.jongepier@uva.nl"
__license__ = "MIT"

import sys
import re
import argparse
from collections import defaultdict


class ToMat(object):

    def __init__(self, taxlist, bblastlike, centrifugelike, krakenlike, dbtaxonomy, dbfasta, dbmap):
        self.taxlist = taxlist
        self.bblastlike = bblastlike
        self.centrifugelike = centrifugelike
        self.krakenlike = krakenlike
        self.dbtaxonomy = dbtaxonomy
        self.dbfasta = dbfasta
        self.dbmap = dbmap

        if taxlist is not None:
            self.pathfile = taxlist
        elif bblastlike is not None:
            self.pathfile = bblastlike
        elif centrifugelike is not None:
            self.pathfile = centrifugelike
        elif krakenlike is not None:
            self.pathfile = krakenlike
        self.dir = '/'.join(self.pathfile.split('/')[:3])
        self.run = self.pathfile.split('/')[1]
        self.method = self.pathfile.split('/')[2]
        self.sample = self.pathfile.split('/')[3].split('.')[0]
        self.outpath = self.dir + '/' + self.sample + '.' + self.method
        ## if input is bblastlike then create otumat and taxmat from taxlist created here
        if any([bblastlike, centrifugelike, krakenlike]):
            self.taxlist = self.outpath + ".taxlist"
        return


    def get_reftax_dict(self):
        '''
        Example output: {refid : [Domain, Phylum, ...]}
        Create a dict of taxdb identifyers and taxdb taxonomy.
        Input file can be taxonomy table or fasta with taxonomy
        in header, depending which one is provided.
        '''
        refdict = {}
        if self.dbtaxonomy is not None:
            with open(self.dbtaxonomy) as reftax:
                for l in reftax.readlines():
                    refid, taxstr = l.strip().split('\t')[:2]
                    taxstr = re.sub(r'D_[0-6]__', '', taxstr)
                    refdict[refid] = taxstr.split(';')[:6]
        elif self.dbfasta is not None:
            with open(self.dbfasta) as fasta:
                for l in fasta.readlines():
                    if l.startswith('>'):
                        line = l.strip().split('>')
                        refid = line[1].split(' ')[0]
                        taxstr = line[1].split(' ')[1]
                        refdict[refid] = taxstr.split(';')[:6]
        return refdict


    def get_krakentax_dict(self):
        '''
        Example output: {ncbiid : [silvaid1,silvaid2,...]}
        For krakenlike input the ncbi refids should first be
        translated into silva refids. This creates a dictionary
        of silva to ncbi id. Each silvaid has only one ncbiid
        but same ncbiid's can have multiple silvaid's.
        '''
        krakendict = defaultdict(list)
        with open (self.dbmap) as map:
            for l in map.readlines():
                line = l.strip().split('\t')
                ncbiid = line[1]
                krakendict[ncbiid].append(line[0])
        return krakendict


    def get_krakenlike_read_dict(self):
        '''
        Example output: {readid : [refid1, refid2, ...]}
        For krakenlike input the ncbi refids in the input table
        should first be replaces by the silva refids.
        There are multiple silva ref ids associated with each
        readid hense a dict of lists.
        '''
        with open(self.krakenlike) as f:
            readdict = defaultdict(list)
            krakendict = self.get_krakentax_dict()
            for l in f.readlines():
                if l.startswith("#") or not l.strip():
                    continue
                else:
                    readid, refrefid = l.strip().split('\t')[1:3]
                    if refrefid in krakendict:
                        readdict[readid].extend(krakendict[refrefid])
        return readdict


    def get_read_dict(self):
        '''
        Example output: {readid : [refid1, refid2, ...]}
        Read a bblastlike file and store reads as keys in a
        dictionary with tadid's as values list. Some classifiers
        return multiple refids per read, hense the list (e.g.
        centrifuge, but not bestblast).
        '''
        readdict = defaultdict(list)
        with open(self.pathfile) as f:
            for l in f.readlines():
                if l.startswith("#") or not l.strip():
                    continue
                else:
                    readid, refid = l.strip().split('\t')[:2]
                    readdict[readid].append(refid)
        return readdict


    def parse_read_dict(self):
        '''
        Parse correct read dict depending on input data type.
        '''
        if self.krakenlike is not None:
            readdict = self.get_krakenlike_read_dict()
        else:
            readdict = self.get_read_dict()
        return readdict


    def get_taxlist_dict(self):
        '''
        Example output: {readid : [Domain, Phylum, Class..]}
        Read a bblast like file and link taxdb identifyers
        to taxdb taxonomy in refdict to store in taxlist dictionary.
        '''
        readdict = self.parse_read_dict()
        refdict = self.get_reftax_dict()
        taxdict = defaultdict(list)
        for readid in readdict:
            refidlist = readdict[readid]
            for refid in refidlist:
                if refid in refdict and len(refdict[refid]) > 5:
                    ''' If already in dict, compare for each tax
                    level whether there is concensus. If not replace
                    with NA. '''
                    if readid in taxdict:
                        for level in range(5, -1, -1):
                            if taxdict[readid][level] == refdict[refid][level]:
                                break
                            else:
                                taxdict[readid][level] = "NA"
                    else:
                        taxdict[readid] = refdict[refid]
        return taxdict


    def write_taxlist(self):
        '''
        Example out: "readid \t Domain \t Phylum ..."
        Write taxlist to file, incl. header line.
        '''
        taxlist = self.get_taxlist_dict()
        with open(self.outpath + ".taxlist", 'w') as f:
            f.write('#readid' + '\t' + 'Domain' + '\t' +
                    'Phylum' + '\t' + 'Class' + '\t' +
                    'Order' + '\t' + 'Family' + '\t' +
                    'Genus' + '\n')
            for readid in taxlist:
                tax = taxlist[readid]
                f.write(readid + '\t' + '\t'.join(tax) + '\n')


    def get_taxmat_dict(self):
        '''
        Example out: {taxstr: [Domain,Phylum,...]}
        Read taxlist, concatenate taxonomy into taxid
        and create lib with key taxid and value taxonomy.
        '''
        with open(self.taxlist) as f:
            taxdict = {}
            for l in f.readlines():
                if l.startswith("#") or not l.strip():
                    continue
                else:
                    line = l.strip().split('\t')
                    tax = line[1:7]
                    taxid = '_'.join(tax)
                    taxdict[taxid] = tax
        return taxdict


    def write_taxmat(self):
        '''
        Example out: "taxstr \t Domain \t Phylum ..."
        Write taxmat to file, incl. header line.
        '''
        with open(self.outpath + ".taxmat", 'w') as f:
            f.write('#taxid' + '\t' + 'Domain' + '\t' +
                    'Phylum' + '\t' + 'Class' + '\t' +
                    'Order' + '\t' + 'Family' + '\t' +
                    'Genus' + '\n')
            for taxid in self.get_taxmat_dict():
                tax = self.get_taxmat_dict()[taxid]
                taxlab = re.sub(r'[_NA]*_NA$', '', taxid)
                taxstr = '\t'.join(tax)
                taxstr = re.sub(r'[\tNA]*\tNA$', '', taxstr)
                taxstr = re.sub(r'_', ' ', taxstr)
                f.write(taxlab + '\t' + taxstr + '\n')


    def get_otumat_dict(self):
        '''
        Example out: {taxstr: count}
        Read taxlist, concatenate taxonomy into taxid
        and create lib witk key taxid and value count
        where count is the frequency each taxid occured
        in taxlist
        '''
        with open(self.taxlist) as f:
            otudict = defaultdict(int)
            for l in f.readlines():
                if l.startswith("#") or not l.strip():
                    continue
                else:
                    line = l.strip().split('\t')
                    tax = line[1:7]
                    taxid = '_'.join(tax)
                    otudict[taxid] += 1
        return otudict


    def write_otumat(self):
        '''
        Example output: "taxstr \t count"
        Write otumat to file, incl. header line.
        '''
        with open(self.outpath + ".otumat", 'w') as f:
            f.write('#taxid' + '\t' + self.method + '_' + self.sample + '\n')
            for taxid in self.get_otumat_dict():
                count = self.get_otumat_dict()[taxid]
                taxlab = re.sub(r'[_NA]*_NA$', '', taxid)
                f.write(taxlab + '\t' + str(count) + '\n')


def main(taxlist, bblastlike, centrifugelike, krakenlike, dbtaxonomy, dbfasta, dbmap):
    m = ToMat(
        taxlist = taxlist,
        bblastlike = bblastlike,
        centrifugelike = centrifugelike,
        krakenlike = krakenlike,
        dbtaxonomy = dbtaxonomy,
        dbfasta = dbfasta,
        dbmap = dbmap
    )
    if any([bblastlike, centrifugelike, krakenlike]):
        m.write_taxlist()
    m.write_taxmat()
    m.write_otumat()
    return


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Convert output tables '
                        'from classifyers in nanoclass snakemake pipeline '
                        'to a taxonomy and otu matrix. These serve to '
                        'generate taxonomic barplots. If not given, the '
                        'script also produces a taxlist which serves as input '
                        'to assign consensus and majority classifications.')
    parser.add_argument('-l', '--taxlist', default=None,
                        help='Path to a file with read IDs in the '
                        'first column, followed by at least 6 columns '
                        'with tab-delimited 7-level taxonomic classifications '
                        'sensu Silva.')
    parser.add_argument('-b', '--bblastlike', default=None,
                        help='Path to a best-blast-like output file '
                        'with the first column listing the readID and '
                        'the second column the taxonomic identifyer.')
    parser.add_argument('-c', '--centrifugelike', default=None,
                        help='Path to a centrifuge-like output file. '
                        'First column should contain read ID and second '
                        'the taxonomic identifyer.')
    parser.add_argument('-k', '--krakenlike', default=None,
                        help='Path to a kraken-like output file. '
                        'Second column should contain read ID and third '
                        'the ncbi taxonomic identifyer.')
    parser.add_argument('-t', '--dbtaxonomy', default=None,
                        help='Path to the taxonomy of the database used '
                        'to get taxonomic classifications. Only required '
                        'in combination with --bblastlike.')
    parser.add_argument('-f', '--dbfasta', default=None,
                        help='Path to the fasta of database that was used '
                        'to get taxonomic classifications. Fasta headers '
                        'should contain taxids and taxo string. Only required '
                        'in combination with --centrifugelike.')
    parser.add_argument('-m', '--dbmap', default=None,
                        help='Path to the map of database that was used '
                        'to get taxonomic classifications. Map contains '
                        'link between taxonomy in third column of '
                        '--krakenlike and taxids in --dbfasta. Only '
                        'required in combination with --krakenlike.')
    args = parser.parse_args()

    assert any([args.taxlist, args.bblastlike, args.centrifugelike,
                args.krakenlike]), '''
        \n Please provide the path to the taxlist or bblastlike.
        Read the help (--help) for further information.
        '''

    main(
        taxlist = args.taxlist,
        bblastlike = args.bblastlike,
        centrifugelike = args.centrifugelike,
        krakenlike = args.krakenlike,
        dbtaxonomy = args.dbtaxonomy,
        dbfasta = args.dbfasta,
        dbmap = args.dbmap
        )