gvclass: Taxonomic Classification of Giant Viruses and MAGs
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
version 0.9.4 March 7, 2023
Giant viruses are abundant and diverse and frequently found in environmental microbiomes. gvclass assigns taxonomy to putative giant virus contigs or metagenome assembled genomes ( GVMAGs ). It uses a conservative approach based on the consensus of single protein trees built from up to 9 giant virus orthologous groups ( GVOGs ). These 9 GVOGs are conserved across different lineages in the viral phylum Nucleocytoviricota.
Running gvclass
Requirements
-
Conda environment wih snakemake, check here: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
-
Input is a directory that contains single contigs or MAGs as nucleic acid (fna) or proteins (faa)
-
File extensions .fna or .faa
-
Recommended length for single contigs is 50kb, but at least 20kb
-
No special characters (".", ";", ":") in filebase name, "_" or "-" are okay
-
No whitespace in sequence headers
-
Recommended sequence header format if faa provided: |
-
Input will be checked and reformatted if necessary
IMG/VR
- Upload you metagenome assembled genome or single contig to IMG/VR using the gvclass feature
Snakemake workflow
- Clone the repository
git clone https://github.com/NeLLi-team/gvclass
- Test gvclass using the provided giant virus assemblies
snakemake -j 24 --use-conda --config querydir="example"
- If this completes successfully, run it using your own directory of query genomes
snakemake -j <number of processes> --use-conda --config querydir="<path to query dir>"
Docker / Shifter
- Will be provided soon
Interpretation of the results
- The classification result is summarized in a tab separated file <query name>.summary.tab
Taxonomy assignments
-
Taxonomy assignments are provided on different taxonomic levels.
-
To yield an assignments all nearest neighbors in GVOG phylogenetic tree have to be in agreement.
-
Depending on the number of identified unique GVOGs an assignment is provided if at least 1 GVOG (stringency "gte1"), 2 GVOGs (stringency "gte2") or 3 GVOGs (stringency "gte3") were found.
-
Less stringent is the "majority" approach, here more than 50% of identified markers have to yield the same taxonomy to enable and assignment.
-
If taxonomy assignments are not in agreement at a low taxonomy level (e.g. species, genus, or family) then the next higher taxonomy level will be evaluated, up to the domain level.
-
March 2023: Added experimental feature xgboost classifier based on kmers. Provides assignment of provided sequences (fna) to cellular domains, phages or NCLDV together with order level assignment (if NCLDV).
Contamination
-
Giant virus genomes typically have less than 10 out of a set of 56 universal cellular housekeeping genes (UNI56). Higher UNI56 counts indicate cellular contamination, or giant virus sequences that are located on host contigs.
- UNI56u (unique counts), UNI56t(total counts), UNI56df (duplication factor) are provided and can be used for further quality filtering
-
Giant virus genomes typically have a duplication factor of GVOG7 (as subset of GVOG9) of below 3. Higher GVOG7 duplication factors indicate the presence mixed viral populations.
- GVOG7u (unique counts), GVOG7t(total counts), GVOG7df (duplication factor) are provided and can be used for further quality filtering
-
Giant viruses may break any of these rules, thus, gvclass does not perform automatic quality filtering based on marker gene counts.
Benchmarking
- Will be provided soon
Citation
- Will be provided soon
References
Acknowledgements
GVClass was developed by the New Lineages of Life Group at the DOE Joint Genome Institute supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231.
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | import click from Bio import SeqIO import pandas as pd import os @click.command() @click.option('--input', '-i', type=click.Path(exists=True), help='Input FASTA file') @click.option('--output', '-o', type=click.Path(), help='Output FASTA file') @click.option('--stats', '-s', type=click.Path(), help='Genome stats outfile') def main(input, output, stats): # read in the input file records = list(SeqIO.parse(input, "fasta")) # modify the sequence IDs and descriptions protcount = 0 for record in records: protcount += 1 if "|" in record.id: record.id = record.id.split("|")[-1] record.id = f"{os.path.splitext(os.path.basename(input))[0]}|{record.id}" record.description = "" # write the reformatted sequences to the output file SeqIO.write(records, output, "fasta") # calculate statistics and write them to the stats file stats_dict = { 'query': os.path.splitext(os.path.basename(input))[0], 'contigs': 'no_fna', 'LENbp': 'no_fna', 'GCperc': 'no_fna', 'genecount': protcount, 'CODINGperc': 'no_fna', 'ttable': 'no_fna' } pd.DataFrame.from_dict(stats_dict, orient='index').T.to_csv(stats, sep="\t", index=False) if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | import click import os import sys import shutil import glob from subprocess import call from Bio import SeqIO import pandas as pd def calc_stats(fnain, faain): """ Calculate coding density, assembly size, GC% """ cumulative_gc = 0 cumulative_len = 0 cumulative_len_aa = 0 gene_count = 0 contigs = 0 for seq_record in SeqIO.parse(fnain, "fasta"): cumulative_gc += seq_record.seq.upper().count('G') + seq_record.seq.upper().count('C') cumulative_len += len(seq_record.seq) contigs += 1 for seq_record in SeqIO.parse(faain, "fasta"): cumulative_len_aa += len(seq_record.seq) gene_count += 1 return [contigs, cumulative_len, round(cumulative_gc / cumulative_len * 100, 2), gene_count, round(cumulative_len_aa * 3 / cumulative_len * 100, 2)] def check_filename(fnafile): """ Input has to be .fna with no additional . in filename """ if len(fnafile.split(".")) > 2: print("fnafile name contains additional '.'") sys.exit() if fnafile.split(".")[1] != "fna": print("fnafile ending needs to be .fna") sys.exit() def rename_header(bestcode, finalfaa, gffout): """ Rename sequence header to ><filename>|<proteinid> """ records = [] infilebase = os.path.splitext(os.path.basename(bestcode))[0] for seq_record in SeqIO.parse(bestcode, "fasta"): headerbase = seq_record.description.split("|")[0] if headerbase != infilebase: seq_record.id = f"{infilebase}|{str(seq_record.id).split()[0]}" seq_record.description = "" records.append(seq_record) elif headerbase == infilebase: seq_record.id = str(seq_record.id).split()[0] seq_record.description = "" records.append(seq_record) SeqIO.write(records, finalfaa, "fasta") # keep only best gff shutil.copy(gffout + "_code" + bestcode.split("_code")[-1], gffout) def run_genecalling_codes(fnafile, code, gffout): """ Genecalling with prodigal and defined tt """ faaout = gffout.replace(".gff", ".faa") if code > 0: run_command = f"prodigal -n -q -f gff -i {fnafile} -a {faaout}_code{code} -o {gffout}_code{code} -g {code}" call(run_command, shell=True) else: # output for snakemake with -p meta run_command = f"prodigal -n -q -f gff -p meta -i {fnafile} -a {faaout}_codemeta -o {gffout}" call(run_command, shell=True) shutil.copy(gffout, gffout + "_codemeta") @click.command() @click.option('--fnafile', '-f', type=click.Path(exists=True), help='Input FNA file') @click.option('--gffout', '-g', type=click.Path(), help='Output GFF file') @click.option('--genecalling_statsout', '-gs', '--genecalling-statsout', type=click.Path(), help='Genecalling statistics file') @click.option('--summary_statsout', '-ss', type=click.Path(), help='Genome stats outfile') @click.option('--finalfaa', '-fa', type=click.Path(), help='Output FASTA file') def main(fnafile, gffout, genecalling_statsout, summary_statsout, finalfaa): # check input check_filename(fnafile) # gene calling # create output for snakemake codes = [0, 1, 4, 6, 11] # pass these from snakemake config instead for code in codes: run_genecalling_codes(fnafile, code, gffout) stats_dict = {} faaoutdir = os.path.dirname(gffout) for faain in glob.glob(os.path.join(faaoutdir, "*.faa_code*")): genome_id = os.path.basename(faain) stats_dict[genome_id] = calc_stats(fnafile, faain) # gene calling stats to select ttable that yields highest coding density stats_df = pd.DataFrame.from_dict(stats_dict, columns=["contigs", "LENbp", "GCperc", "genecount", "CODINGperc"], orient="index") stats_df = stats_df.sort_values("CODINGperc", ascending=False) stats_df["ttable"] = stats_df.index.map(lambda x: x.split("_")[-1]) stats_df.insert(0, "query", stats_df.index.map(lambda x: os.path.splitext(x)[0])) bestcode = os.path.join(faaoutdir, stats_df.index[0]) # genome id with code appended that had highest coding density stats_df.to_csv(genecalling_statsout, sep="\t", index=None) stats_df.head(1).to_csv(summary_statsout, sep="\t", index=None) # rename headers and cleanup rename_header(bestcode, finalfaa, gffout) for tempout in glob.glob(f"{gffout}_code*"): os.remove(tempout) faaout = tempout.replace(".gff", ".faa") if os.path.isfile(faaout): os.remove(faaout) if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | import click import pandas as pd import pickle from Bio import SeqIO from Bio.Seq import Seq from sklearn.preprocessing import LabelEncoder def get_label_mapping(mapping_file): """Returns a dictionary mapping encoded labels to their original values""" with open(mapping_file) as f: return {int(line.split('\t')[0]): line.strip().split('\t')[1] for line in f} def get_best_kmers(kmers_file): """Returns a list of the best kmers""" with open(kmers_file) as f: return [line.strip() for line in f.readlines()] def count_kmers(kmers, fna): """Counts the occurrences of kmers in a given FASTA file""" kmer_counts_genome_dict = {x:0 for x in kmers} total_seqs = sum(1 for _ in SeqIO.parse(fna, "fasta")) with click.progressbar(SeqIO.parse(fna, "fasta"), label='Counting kmers', length=total_seqs) as records: for seq_record in records: for kmer in kmers: kmer_counts_genome_dict[kmer] += Seq(seq_record.seq).count_overlap(kmer) return kmer_counts_genome_dict def predict(model, kmers, labels, queryseq): # Load XGBoost model xgbmodel = pickle.load(open(model, 'rb')) # Load label mapping inv_mapping = get_label_mapping(labels) # Load and preprocess query sequence counts kmers_list = get_best_kmers(kmers) kmer_count_all_dict = {queryseq.split("/")[-1].split(".f")[0]: count_kmers(kmers_list, queryseq)} df_qcounts = pd.DataFrame.from_dict(kmer_count_all_dict).T.fillna(0) df_qcounts = df_qcounts.div(df_qcounts.sum(axis=1), axis=0) df_qcounts = df_qcounts.reset_index().drop(['index'], axis=1) # Ensure correct order of features cols_when_model_builds = xgbmodel.get_booster().feature_names df_qcounts = df_qcounts[cols_when_model_builds] # Make prediction and print result y_pred = xgbmodel.predict(df_qcounts.head(1)) prediction = inv_mapping[y_pred[0]] prob_val = xgbmodel.predict_proba(df_qcounts.head(1)) print (prob_val) proba_dict = {inv_mapping[x]:[prob_val[0,x]] for x,y in inv_mapping.items()} return prediction, proba_dict @click.command() @click.option('--queryseq', '-q', type=click.Path(exists=True), required=True, help='Path to query sequence') @click.option('--model', '-m', type=click.Path(exists=True), required=True, help='Path to prebuilt model') @click.option('--kmers', '-k', type=click.Path(exists=True), required=True, help='Path to kmer file') @click.option('--labels', '-l', type=click.Path(exists=True), required=True, help='Path to labels') @click.option('--out', '-o', type=click.Path(exists=False), required=True, help='Path to prediction output') def main(queryseq, model, kmers, labels, out): # get domain level prediction prediction_domain, proba_dict_domain = predict(model, kmers, labels, queryseq) # if NCDLV then also get order level prediction if prediction_domain == "NCLDV": model_order = model.replace("domain", "order") kmers_order = kmers.replace("domain", "order") labels_order = labels.replace("domain", "order") prediction_order, proba_dict_order = predict(model_order, kmers_order, labels_order, queryseq) else: prediction_order = "" proba_dict_order = {} with open(out, "w") as outfile: outfile.write(f"{queryseq.split('/')[-1].split('.f')[0]}\t{prediction_order}___{prediction_domain}") proba_dict = {**proba_dict_domain, **proba_dict_order} df_proba = pd.DataFrame(proba_dict) df_proba["query"] = queryseq.split('/')[-1].split('.f')[0] df_proba = df_proba.set_index("query") #print (df_proba) #df_proba.to_csv("test.tab", sep="\t", index=None) df_proba.to_csv(f"{out}.proba", sep="\t", index=True) if __name__ == "__main__": main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | import click import subprocess import sys import pandas as pd def run_cmd(cmd): """Runs a command in the shell and prints standard output and error""" sp = subprocess.Popen(cmd, shell=True) std_out, std_err = sp.communicate() print('std_err: ', std_err) print('std_out: ', std_out) def get_models(modelsin): """ Generate list of all model names Some models might have no hits in hmmout, but they should be displayed in count matrix """ models = [] with open(modelsin, "r") as f: models = [line.split()[-1].rstrip() for line in f if line.startswith("NAME")] return models def get_markercompleteness(models, hmmout, query): """Returns a dictionary of marker count for each model""" count_dict = {x: 0 for x in models} seen = [] with open(hmmout, "r") as f: lines = [line.rstrip() for line in f if not line.startswith("#")] for line in lines: if line.split()[0] not in seen: count_dict[line.split()[3]] += 1 seen.append(line.split()[0]) count_dict = {query: count_dict} return count_dict @click.command() @click.option('--queryfaa', '-q', type=click.Path(exists=True), help='Input query FASTA file') @click.option('--modelscombined', '-m', type=click.Path(exists=True), help='Combined HMM model file') @click.option('--hmmout', '-h', type=click.Path(), help='Output file for HMM hits') @click.option('--target', '-t', type=str, help='Target database (UNI56 or UNIREF90)') @click.option('--countout', '-c', type=click.Path(), help='Output file for marker gene counts') def main(queryfaa, modelscombined, hmmout, target, countout): ### hmmsearch ### cutoff = "-E 1e-10" if target == "UNI56": cutoff = "--cut_ga" hmmsearch = ["hmmsearch" + " --noali" + " " + cutoff + " " + " --domtblout " + hmmout + " --cpu 4 " + modelscombined + " " + queryfaa] run_cmd(hmmsearch) ### get counts ### query = queryfaa.split("/")[-1].split(".")[0] models = get_models(modelscombined) count_dict = get_markercompleteness(models, hmmout, query) pd.DataFrame.from_dict(count_dict).T.to_csv(countout, sep="\t") if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | import os from Bio import SeqIO import shutil from collections import defaultdict import click @click.command() @click.option('--hmmout', '-h', required=True, help='Path to HMM output file.') @click.option('--queryfaa', '-q', required=True, help='Path to query protein FASTA file.') @click.option('--queryhitsfaa', '-o', required=True, help='Path to output file containing hits.') def main(hmmout, queryfaa, queryhitsfaa): outmodel = os.path.splitext(os.path.basename(queryhitsfaa))[0] hits_model_dict = get_qhits(hmmout, outmodel) export_hitsfaa(hits_model_dict, queryfaa, queryhitsfaa) def export_hitsfaa(hits_model_dict, queryfaa, queryhitsfaa): query_ids = set() with click.progressbar(hits_model_dict.values(), label='Exporting hits to file') as bar1: for hits in bar1: query_ids |= set(hits) with open(queryhitsfaa, "w") as outfile: with click.progressbar(SeqIO.parse(queryfaa, "fasta"), label='Writing hits to file') as bar2: for seq_record in bar2: if seq_record.description in query_ids: SeqIO.write(seq_record, outfile, "fasta") def get_qhits(hmmout, outmodel): hits_model_dict = defaultdict(set) with open(hmmout, "r") as infile: with click.progressbar(infile, label='Parsing HMM output') as bar: for line in bar: if not line.startswith("#") and outmodel in line: hits_model_dict[line.split()[3]].add(line.split()[0]) return hits_model_dict if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | import click import subprocess from Bio import SeqIO from collections import defaultdict import os def run_cmd(cmd): sp = subprocess.Popen(cmd, #stderr=subprocess.PIPE, #stdout=subprocess.PIPE, shell=True) std_out, std_err = sp.communicate() print('std_err: ', std_err) print('std_out: ', std_out) def run_blastp(queryfaa, refdb, blastpout): # no evalue set, what is the default? dblastp = ["diamond blastp --threads 8 " "--quiet --outfmt 6 " "--more-sensitive --query-cover 30 " "--subject-cover 30 " "-e 1e-10 " "-d {} -q {} -o {}".format(refdb, queryfaa, blastpout)] run_cmd(dblastp) def parse_blastp(blastpout): queryids_hits_dict = defaultdict(list) seen = [] # there could be multiple queries try: num_queries = len(set([line.split()[0] for line in open(blastpout)])) with open(blastpout, "r") as infile: for line in infile: queryname = line.split()[0] subjectname = line.split()[1] pident = float(line.split()[2]) # Yield equal number of subject ids per query if subjectname not in seen and len(queryids_hits_dict[queryname]) <= 100/num_queries and pident!=100: queryids_hits_dict[queryname].append(subjectname) seen.append(subjectname) except: pass # single list with all 100 top subject ids besthits = [x for v in queryids_hits_dict.values() for x in v] return besthits def reduce_ref(queryfaa, reffaa, reduced_reffaa_merged, refdb, blastpout): run_blastp(queryfaa, refdb, blastpout) besthits = parse_blastp(blastpout) seenids = [] outseqs = [] for seq_record in SeqIO.parse(queryfaa, "fasta"): if seq_record.id not in seenids: outseqs.append(seq_record) seenids.append(seq_record.id) for seq_record in SeqIO.parse(reffaa, "fasta"): if seq_record.description in besthits and seq_record.id not in seenids: outseqs.append(seq_record) seenids.append(seq_record.id) SeqIO.write(outseqs, reduced_reffaa_merged, "fasta") @click.command() @click.option('--queryfaa', '-q', required=True, help='Query sequences from hmmsearch for a single marker') @click.option('--reffaa', '-r', required=True, help='Ref seqs for a single marker') @click.option('--refdb', '-d', required=True, help='Diamond formatted db for a single marker') @click.option('--blastpout', '-b', required=True, help='Output file name for diamond blastp') @click.option('--reduced_reffaa_merged', '-o', required=True, help='Output file name for reduced reference sequences') def main(queryfaa, reffaa, refdb, blastpout, reduced_reffaa_merged): if os.path.getsize(queryfaa) > 0: reduce_ref(queryfaa, reffaa, reduced_reffaa_merged, refdb, blastpout) else: print (f"{queryfaa} is empty, no hits, omitting blast") if __name__ == '__main__': main() |
1
of
scripts/04blastp_reduce_merge.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import subprocess import click def run_cmd(cmd): sp = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE, shell=True) (std_out, std_err) = sp.communicate() print('std_err: ', std_err) print('std_out: ', std_out) @click.command() @click.option("--queryfaamerged", "-q", required=True, type=click.Path(exists=True), help="Input fasta file containing merged fasta sequences") @click.option("--alnout", "-a", required=True, type=click.Path(), help="Output fasta file for aligned sequences") @click.option("--trimout", "-t", required=True, type=click.Path(), help="Output fasta file for trimmed sequences") def main(queryfaamerged, alnout, trimout): mafftaln = f"mafft --quiet --thread 4 {queryfaamerged} > {alnout}" trimaln = f"trimal -gt 0.1 -in {alnout} -out {trimout}" for cmd in [mafftaln, trimaln]: run_cmd(cmd) if __name__ == "__main__": main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import click import subprocess def run_cmd(cmd): sp = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE, shell=True) (std_out, std_err) = sp.communicate() print('std_err: ', std_err) print('std_out: ', std_out) @click.command() @click.option("--aln", "-a", type=click.Path(exists=True), help='Alignment file in FASTA format') @click.option("--treeout", "-t", type=click.Path(), help='Output file for the tree') def main(aln, treeout): runtree = f"fasttree -lg < {aln} > {treeout}" run_cmd(runtree) if __name__ == "__main__": main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | import glob from collections import Counter from Bio import AlignIO import click def get_nn(aln, query): hd_dict_paralogs = {} for qseq in AlignIO.read(aln, "fasta"): if qseq.id.split("|")[0] == query: hd_dict = {} for rseq in AlignIO.read(aln, "fasta"): if rseq.id.split("|")[0] != query: hd = hamming_distance(qseq.seq, rseq.seq) if hd > 30: hd_dict[rseq.id] = hd nn = max(hd_dict, key=hd_dict.get) hd_dict_paralogs[nn + ";;;" + qseq.id] = hd_dict[nn] return hd_dict_paralogs def flattenkeys(dict1, labels_dict, taxlevel): if taxlevel == "order": level = 3 elif taxlevel == "family": level = 2 elif taxlevel == "genus": level = 1 dictlist = [dict1[model] for model in dict1] flattened = list({k: v for d in dictlist for k, v in d.items()}.keys()) return [labels_dict[tax.split("|")[0]][level] for tax in flattened] def hamming_distance(qseq, rseq): """ get hamming distance between two aligned sequences """ # return id% hd = (sum(aa1 != aa2 for aa1, aa2 in zip(qseq, rseq))) return hd @click.command() @click.option("--query", "-q", required=True) @click.option("--aln_outpath", "-a", required=True) @click.option("--outname", "-o", required=True) @click.option("--ncldv_labels", "-l", required=True) def main(query, aln_outpath, outname, ncldv_labels): hd_dict_all = {} hd_dict_all_best = {} for aln in glob.glob(aln_outpath + "*.mafft01"): modelname = aln.split("/")[-1].split(".")[0] try: hd_dict_all[modelname] = get_nn(aln, query) hd_dict_all_best[modelname] = {max(hd_dict_all[modelname], key=hd_dict_all[modelname].get) : hd_dict_all[modelname][max(hd_dict_all[modelname], key=hd_dict_all[modelname].get)]} except: pass labels_dict = {} with open(ncldv_labels, "r") as infile: for line in infile: taxid, lineage = line.strip().split("\t") labels_dict[taxid] = lineage.split("|") final_aln_dict = Counter(flattenkeys(hd_dict_all_best, labels_dict, "order")) aln_result = "|".join([f"{x}:{y}" for x, y in final_aln_dict.items()]) with open(outname, "w") as outfile: if len(aln_result) > 0: outfile.write(f"{query}\t{aln_result}\taln\n") else: outfile.write(f"{query}\tno_hits\taln\n") for model, value in hd_dict_all.items(): value = dict(sorted(value.items(), key=lambda item: item[1], reverse=True)) outfile.write("\n".join([f"{model}\t{k.split(';;;')[1]}\t{k.split(';;;')[0]}\t{'|'.join(labels_dict[k.split('|')[0]])}\t{v}" for k, v in value.items()]) + "\n") if __name__ == "__main__": main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | import glob import click from ete3 import Tree from collections import defaultdict from collections import Counter import pandas as pd def flattenkeys_to_list(dict1, labels_dict, taxlevel): if taxlevel=="order": level = 3 elif taxlevel=="family": level = 2 elif taxlevel=="genus": level = 1 dictlist = [ dict1[model] for model in dict1.keys() ] flattened = list({k: v for d in dictlist for k, v in d.items()}.keys()) return [labels_dict[tax.split("|")[0]][level] for tax in flattened] def unpack_family(node, leaves, query, seen): """ get all leaves under a parental node """ children = node.get_children() for child in children: childname = str(child).replace("--","").replace("/-", "").replace("\-", "").replace("\n","") if child.is_leaf() and childname.split("|")[0] != query: seen.append(childname) leaves.append(childname) elif not child.is_leaf() and childname != query and childname not in seen: seen.append(childname) unpack_family(child, leaves, query, seen) elif child.is_leaf() and childname == query: seen.append(childname) else: pass if len(leaves)==0: parentnode = node.up unpack_family(parentnode, leaves, query, seen) return leaves @click.command() @click.option('-q', '--query', required=True, help='Query to search in the tree.') @click.option('-t', '--tree_outpath', required=True, help='Path to the tree directory.') @click.option('-o', '--outname', required=True, help='Output file name.') @click.option('-l', '--ncldv_labels', required=True, help='Path to the NCBI taxonomy labels file.') def main(query, tree_outpath, outname, ncldv_labels): def get_closestrelative(query, leaves, node, tree): distance = 10 closestrelative = "nd" for child in leaves: if child != query: newdist = t.get_distance(child, query, topology_only = False) if newdist < distance: distance = newdist closestrelative = child else: pass return closestrelative, distance def get_neighbor(t, modelname, query): tree_dict_paralogs = {} # include nearest neighbor for all paralogs for node in t.traverse(): if node.name.split("|")[0] == query: queryintree = node.name print (queryintree) # move up one node in the tree # collect all terminal leaves under the parent node leaves = [] seen = [] leaves = unpack_family(node, leaves, query, seen) print (leaves) closestrelative, distance = get_closestrelative(queryintree, leaves, node, t) print (closestrelative) tree_dict_paralogs[closestrelative + ";;;" + queryintree] = distance #closestrelativep = min(tree_dict_paralogs, key=tree_dict_paralogs.get) # get only ref that is most similar from all paralogs = shorted distance #return {closestrelativep : tree_dict_paralogs[closestrelativep]} # return only ref that is most similar return tree_dict_paralogs labels_dict = {} with open(ncldv_labels, "r") as infile: for line in infile: taxid = line.split("\t")[0] lineage = line.split("\t")[1].replace("\n","").split("|") labels_dict[taxid] = lineage # all distances including paralogs tree_dict_all = {} # dict with only single paralog that had shortest distance to reference tree_dict_all_best = {} for ftree in glob.glob(tree_outpath + "/*.FTWAG"): try: t = Tree(ftree) modelname = ftree.split("/")[-1].split(".")[0] print (f"{modelname}") tree_dict_all[modelname] = get_neighbor(t, modelname, query) print (f"{tree_dict_all}") tree_dict_all_best[modelname] = {min(tree_dict_all[modelname], key=tree_dict_all[modelname].get) : tree_dict_all[modelname][min(tree_dict_all[modelname], key=tree_dict_all[modelname].get)]} except: print (f"Something wrong with {ftree}") pass # this is the summary result, show number of best hits per tax string # cutoff for order: tree dist 2, aln 30 idperc --> needs further evaluation print (f"{tree_dict_all_best}@@@@@@@@{query}") final_tree_dict = Counter(flattenkeys_to_list(tree_dict_all_best, labels_dict, "order")) tree_result = "|".join([":".join([str(x) for x in pair]) for pair in list(final_tree_dict.items())]) with open(outname, "w") as outfile: if len(tree_result) > 0: outfile.write(f"{query}\t{tree_result}\ttree\n") else: outfile.write(f"{query}\tno_hits\ttree\n") for model, value in tree_dict_all.items(): # print lowest distance first value = dict(sorted(value.items(), key=lambda item: item[1], reverse=False)) #outfile.write("\n".join(f"{model}\t{k.split(';;;')[1]}\t{k.split(';;;')[0]}\t{'|'.join(labels_dict[k.split('|')[0]])}\t{v}" for k,v in value.items()) + "\n") outfile.write("\n".join([ f"{model}\t{k.split(';;;')[1]}\t{k.split(';;;')[0]}\t{'|'.join(labels_dict[k.split('|')[0]])}\t{v}" for k, v in value.items()]) + "\n") if __name__ == "__main__": main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | import click import pandas as pd import glob from collections import Counter def parse_result(tabout): results = [] with open(tabout) as infile: for line in infile: line = line.strip().split() if line[0].startswith("GVOG"): if line[0] == "GVOGm0461" and float(line[-1]) < 1.8: results.append(line) elif line[0] != "GVOGm0461" and float(line[-1]) < 2.2: results.append(line) else: print(f"{line[0]} tree distance to neighbor above threshold, hit removed") return results def process_gvog9(gvog9_count): try: gvog9_models = ["GVOGm0003","GVOGm0013","GVOGm0022","GVOGm0023","GVOGm0054","GVOGm0172","GVOGm0461","GVOGm0760","GVOGm0890"] gvog7_models = ["GVOGm0013","GVOGm0023","GVOGm0054","GVOGm0172","GVOGm0461","GVOGm0760","GVOGm0890"] gvogsout_df = pd.read_csv(gvog9_count, sep="\t", index_col=0) gvogsout_df['GVOG7u'] = (gvogsout_df[gvog7_models] > 0).sum(axis=1) gvogsout_df['GVOG7t'] = (gvogsout_df[gvog7_models]).sum(axis=1) gvogsout_df['GVOG9u'] = (gvogsout_df[gvog9_models] > 0).sum(axis=1) gvogsout_df['GVOG9t'] = (gvogsout_df[gvog9_models]).sum(axis=1) gvogsout_df['GVOG7df'] = gvogsout_df['GVOG7t'] / gvogsout_df['GVOG7u'] if gvogsout_df['GVOG7t'].any() else 0 return list(gvogsout_df.GVOG9u)[0], list(gvogsout_df.GVOG9t)[0], list(gvogsout_df.GVOG7u)[0], list(gvogsout_df.GVOG7t)[0], list(gvogsout_df.GVOG7df)[0], list(gvogsout_df.GVOGm0003)[0] except: return 0,0,0,0,0,0 def process_uni56(uni56_count): try: UNI56out_df = pd.read_csv(uni56_count, sep="\t", index_col=0) UNI56out_models = list(UNI56out_df.columns) UNI56out_df['UNI56u'] = (UNI56out_df[UNI56out_models] > 0).sum(axis=1) UNI56out_df['UNI56t'] = (UNI56out_df[UNI56out_models]).sum(axis=1) UNI56out_df['UNI56df'] = UNI56out_df['UNI56t'] / UNI56out_df['UNI56u'] if UNI56out_df['UNI56t'].any() else 0 return list(UNI56out_df.UNI56u)[0], list(UNI56out_df.UNI56t)[0], list(UNI56out_df.UNI56df)[0] except: return 0, 0, 0 def most_frequent(taxstrings, taxlevel): freq = Counter(taxstrings) if len(freq) == 1: if len(taxstrings) >= 2: return taxstrings[0] else: return "_" most_common = freq.most_common(1)[0] if most_common[1] > len(taxstrings) // 2: return most_common[0] elif taxlevel == "domain": return "-".join(sorted(list(set(taxstrings)))) else: return "_" def tax_annotation(row, level_other, level_ncldv): if row["subject"].split("__")[0] in ["EUK", "ARC", "BAC", "PHAGE"]: return row["subject"].split("__")[0] + "__" + row["taxannot"].split("|")[level_other] else: return row["taxannot"].split("|")[level_ncldv] def tax_domains(row): if row["subject"].split("__")[0] in ["EUK", "ARC", "BAC", "PHAGE"]: return row["subject"].split("__")[0] elif row["taxannot"].split("|")[5] == "Nucleocytoviricota": return "NCLDV" else: return "CONFLICT" def get_final_tax(df, query, stringency_s): df_q = df[df.queryname == query] tax_levels = ["species", "genus", "family", "order", "class", "phylum", "domain"] tax_lists = [list(df_q[tax]) for tax in tax_levels] final_tax = [query] if stringency_s == "majority": for level, lst in zip(tax_levels, tax_lists): final_tax.append(f"{level[0]}_{most_frequent(lst, level)}") else: stringency = int(stringency_s.replace("gte", "")) for level, lst in zip(tax_levels, tax_lists): if len(set(lst)) == 1: final_tax.append(f"{level[0]}_{lst[0]}") elif len(lst) >= stringency: final_tax.append(f"{level[0]}__") else: final_tax.extend(["missing_markers"] * (7 - len(final_tax))) break else: final_tax.extend(["missing_markers"] * (7 - len(final_tax))) if len(set(tax_lists[-1])) == 1: final_tax[-1] = f"{tax_levels[-1][0]}_{tax_lists[-1][0]}" else: for lst, domain in zip(tax_lists[:-1], df_q["domain"]): if domain == "EUK" and "NCLDV" in lst: final_tax[-1] = "d_NCLDV-EUK" break elif domain == "BAC" and "NCLDV" in lst: final_tax[-1] = "d_NCLDV-BAC" break elif domain == "ARC" and "NCLDV" in lst: final_tax[-1] = "d_NCLDV-ARC" break elif domain == "ARC" and "BAC" in lst: final_tax[-1] = "d_ARC-BAC" break elif domain == "ARC" and "EUK" in lst: final_tax[-1] = "d_ARC-EUK" break elif domain == "BAC" and "EUK" in lst: final_tax[-1] = "d_BAC-EUK" break elif domain == "PHAGE" and "NCLDV" in lst: final_tax[-1] = "d_NCLDV-PHAGE" break else: final_tax[-1] = "d__" final_tax.append(stringency_s) final_tax.append(df_q["distance"].mean()) return final_tax def summarize(df): try: # Split the query column and extract the queryname df["queryname"] = df["query"].str.split("|", expand=True)[0] df["species"] = df.apply(lambda row: tax_annotation(row, 6, -1), axis=1) df["genus"] = df.apply(lambda row: tax_annotation(row, 5, 1), axis=1) df["family"] = df.apply(lambda row: tax_annotation(row, 4, 2), axis=1) df["order"] = df.apply(lambda row: tax_annotation(row, 3, 3), axis=1) df["class"] = df.apply(lambda row: tax_annotation(row, 2, 4), axis=1) df["phylum"] = df.apply(lambda row: tax_annotation(row, 1, 5), axis=1) df["domain"] = df.apply(tax_domains, axis=1) df = df[["queryname","species", "genus","family","order", "class", "phylum", "domain", "GVOG", "distance"]] # Get the top hit df_tophit = df.drop_duplicates(subset=['GVOG']) # Get results for different stringencies stringencies = ["gte1", "gte2", "gte3", "majority"] allresults = [get_final_tax(df_tophit, list(df.queryname)[0], stringency) for stringency in stringencies] # Create results dataframe df_results = pd.DataFrame(allresults, columns=["query", "species", "genus", "family", "order", "class", "phylum", "domain", "stringency", "avgdist"]) return df_results except: pass @click.command() @click.option('-n', '--nn_tree', required=True, help='Path to the nearest neighbor tree file.') @click.option('-g', '--gvog9_count', required=True, help='Path to the GVOG9 count file.') @click.option('-u', '--uni56_count', required=True, help='Path to the UNI56 count file.') @click.option('-q', '--querystats', required=True, help='Path to the query stats file.') @click.option('-c', '--xgb_out', required=True, help='Path to the classifier output file.') @click.option('-s', '--summary_out', required=True, help='Path to the output summary file.') def main(nn_tree, gvog9_count, uni56_count, querystats, xgb_out, summary_out): try: query = summary_out.split("/")[-1].split(".")[0] treehits = [] treehits.extend(parse_result(nn_tree)) querystats_df = pd.read_csv(querystats, sep="\t") with open(xgb_out) as f: prediction = [line.split('\t')[1].strip() for line in f if line.strip()] if len(treehits) > 0: df_tree = pd.DataFrame(treehits, columns=["GVOG", "query", "subject", "taxannot", "distance"]) df_tree["distance"] = df_tree["distance"].astype(float) df_results_tree = summarize(df_tree) GVOG9u, GVOG9t, GVOG7u, GVOG7t, GVOG7df, MCP = process_gvog9(gvog9_count) UNI56u, UNI56t, UNI56df = process_uni56(uni56_count) df_results_tree["GVOG9u"] = GVOG9u df_results_tree["GVOG9t"] = GVOG9t df_results_tree["GVOG7u"] = GVOG7u df_results_tree["GVOG7t"] = GVOG7t df_results_tree["GVOG7df"] = GVOG7df df_results_tree["MCP"] = MCP df_results_tree["UNI56u"] = UNI56u df_results_tree["UNI56t"] = UNI56t df_results_tree["UNI56df"] = UNI56df if len(prediction)>0: df_results_tree["xgb"] = prediction[0] else: df_results_tree["xgb"] = "no_fna" df_results_tree = pd.merge(df_results_tree, querystats_df, on='query') df_results_tree.to_csv(summary_out, sep="\t", index=False) elif len(prediction)>0: allresults = [[query, "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", prediction[0], "no_hits", "missing_markers", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"]] cols = ["query", "species", "genus", "family", "order", "class", "phylum", "domain", "xgb", "stringency", "avgdist", "GVOG9u", "GVOG9t", "GVOG7u", "GVOG7t", "GVOG7df", "MCP", "UNI56u", "UNI56t", "UNI56df"] df_results_tree = pd.DataFrame(allresults, columns=cols) df_results_tree = pd.merge(df_results_tree, querystats_df, on='query') df_results_tree.to_csv(summary_out, sep="\t", index=False) else: allresults = [[query, "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "no_fna", "no_hits", "missing_markers", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"]] cols = ["query", "species", "genus", "family", "order", "class", "phylum", "domain", "xgb", "stringency", "avgdist", "GVOG9u", "GVOG9t", "GVOG7u", "GVOG7t", "GVOG7df", "MCP", "UNI56u", "UNI56t", "UNI56df"] df_results_tree = pd.DataFrame(allresults, columns=cols) df_results_tree = pd.merge(df_results_tree, querystats_df, on='query') df_results_tree.to_csv(summary_out, sep="\t", index=False) except: print(f"Empty output for {summary_out.split('/')[-1].split('.')[0]}") with open(summary_out, "w") as outfile: pass if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | import click import os.path import glob import pandas as pd @click.command() @click.option('-r', '--resultsdirparent', type=click.Path(exists=True), help='Parent directory of the summary files.') @click.option('-o', '--combinedout', type=click.Path(), help='Output file name.') def main(resultsdirparent, combinedout): """ Retain only annotation for d_NCLDV at highest available stringency. """ results = [] for sumout in glob.glob(resultsdirparent + "*/*.summary.tab"): try: df = pd.read_csv(sumout, index_col=None, header=0, sep="\t") results.append(df) except: print (sumout + " is empty") df_results_f = pd.concat(results, axis=0, ignore_index=True) df_results_f = df_results_f.fillna(0) # also output d_NCLDV-EUK, d_NCLDV-BAC, d_NCLDV-ARC, d_NCLDV-PHAGE df_results_f_gt3 = df_results_f[(df_results_f['domain'].str.startswith('d_NCLDV')) & (df_results_f['stringency'] == 'gte3')] df_results_f_gt2 = df_results_f[(df_results_f['domain'].str.startswith('d_NCLDV')) & (df_results_f['stringency'] == 'gte2')] df_results_f_gt1 = df_results_f[(df_results_f['domain'].str.startswith('d_NCLDV')) & (df_results_f['stringency'] == 'gte1')] df_results_f_majority = df_results_f[(df_results_f['domain'].str.startswith('d_NCLDV')) & (df_results_f['stringency'] == 'majority')] df_results_f_xgb = df_results_f[(df_results_f['xgb'].str.endswith('NCLDV'))] df_results_f_combined = pd.concat([df_results_f_gt3, df_results_f_gt2, df_results_f_gt1, df_results_f_majority, df_results_f_xgb]) df_results_f_combined = df_results_f_combined.drop_duplicates(subset=['query']) df_results_f_combined.to_csv(combinedout, sep='\t', index=None) if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | import click import tarfile import os.path import shutil @click.command() @click.option('-r', '--resultsdir', type=click.Path(exists=True), help='Directory to be compressed.') @click.option('-o', '--tarout', type=click.Path(), help='Output tarfile name.') def main(resultsdir, tarout): """ Compress a directory to a tar.gz file and delete the original directory. """ with tarfile.open(tarout, "w:gz") as tar: tar.add(resultsdir, arcname=os.path.basename(resultsdir)) try: shutil.rmtree(resultsdir) except OSError as e: print(f"Error: {resultsdir} : {e.strerror}") if __name__ == '__main__': main() |
60 61 62 63 | shell: """ (python workflow/scripts/00reformat.py -i {input} -o {output[0]} -s {output[1]}) &> {log} """ |
83 84 85 86 87 | shell: """ python workflow/scripts/01genecalling_single.py -f {input} -g {output.gffout} -gs {output.genecalling_statsout} -ss {output.best_statsout} -fa {output.faafinalout} &> {log} cp {input} {output.fnafinalout} """ |
106 107 108 109 110 111 112 113 114 115 116 117 | shell: """ if [ -f {params.infilebase}.fna ]; then python workflow/scripts/01predict.py -q {params.infilebase}.fna -m {input.xgbD} -k {input.featuresD} -l {input.mappingsD} -o {output} &> {log}; else touch {output}; fi """ """ Step 1 Identify markers, extract, align """ |
133 134 135 136 | shell: """ (python workflow/scripts/02hmmsearch.py -q {input.queryfaa} -m {input.models} -h {output.GVOG9out} -t GVOG9 -c {output.hitcounts}) &> {log} """ |
154 155 156 157 | shell: """ (python workflow/scripts/02hmmsearch.py -q {input.queryfaa} -m {input.models} -h {output.UNI56out} -t UNI56 -c {output.hitcounts}) &> {log} """ |
171 172 173 174 175 | shell: """ python workflow/scripts/03extract_qhits.py -h {input.hmmout} -q {input.queryfaa} -o {output.queryhitsfaa} touch {output.queryhitsfaa} """ |
194 195 196 197 198 | shell: """ (python workflow/scripts/04blastp_reduce_merge.py -q {input.queryhitsfaa} -r {input.reffaa} -d {input.refdb} -b {output.blastpout} -o {output.mergedfaa}) >& {log} touch {output.blastpout} {output.mergedfaa} """ |
214 215 216 217 218 219 220 221 222 223 224 | shell: """ (python workflow/scripts/05align_trim.py -q {input.mergedfaa} -a {output.aln} -t {output.trimmedaln}) >& {log} touch {output.aln} {output.trimmedaln} """ """ Step 2 - Tree based Build protein trees, get nearest neighbor in trees """ |
238 239 240 241 242 | shell: """ (python workflow/scripts/06build_tree.py -a {input.trimmedaln} -t {output.tree}) >& {log} touch {output.tree} """ |
262 263 264 265 266 | shell: """ (python workflow/scripts/07get_nn_aln.py -q {params.queryname} -a {params.fdir}/queryrefs_aligned/ -o {output.aln_out} -l {labels} python workflow/scripts/08get_nn_tree.py -q {params.queryname} -t {params.fdir}/queryrefs_fasttree/ -o {output.tree_out} -l {labels}) &> {log} """ |
285 286 287 288 289 | shell: """ (python workflow/scripts/09summarize.py -n {input.nn_tree} -g {input.gvog9_count} -u {input.uni56_count} -q {input.querystats} -s {output} -c {input.xgbout} touch {output}) &> {log} """ |
304 305 306 307 308 | shell: """ python workflow/scripts/10combinedout.py -r {params.fdirp} -o {output.combined} touch {output.combined} """ |
323 324 325 326 327 | shell: """ #find {params.fdir} -type f -empty -print -delete python workflow/scripts/11cleanup.py -r {params.fdir} -o {output} """ |
Support
- Future updates
Related Workflows





