Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)
BGCFlow
is a systematic workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes) from internal & public datasets.
At present,
BGCFlow
is only tested and confirmed to work on
Linux
systems with
conda
/
mamba
package manager.
Publication
Matin Nuhamunada, Omkar S. Mohite, Patrick V. Phaneuf, Bernhard O. Palsson, and Tilmann Weber. (2023). BGCFlow: Systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets. bioRxiv 2023.06.14.545018; doi: https://doi.org/10.1101/2023.06.14.545018
Pre-requisites
BGCFlow
requires
gcc
and the
conda
/
mamba
package manager. See
installation instruction
for details.
Please use the latest version of
BGCFlow
available.
Quick Start
A quick and easy way to use
BGCFlow
using
bgcflow_wrapper
.
-
Create a conda environment and install the
BGCFlow
python wrapper :
# create and activate a new conda environment
conda create -n bgcflow pip -y
conda activate bgcflow
# install `BGCFlow` wrapper
pip install git+https://github.com/NBChub/bgcflow_wrapper.git
# make sure to use bgcflow_wrapper version >= 0.2.7
bgcflow --version
- Additional pre-requisites : With the environment activated, install or setup this configurations:
-
Set
conda
channel priorities toflexible
conda config --set channel_priority disabled
conda config --describe channel_priority
-
Java (required to run
metabase
)
conda install openjdk
-
Deploy and run BGCFlow, change
your_bgcflow_directory
variable accordingly:
# Deploy and run BGCFlow
bgcflow clone <your_bgcflow_directory> # clone `BGCFlow` to your_bgcflow_directory
cd <your_bgcflow_directory> # move to BGCFLOW_PATH
bgcflow init # initiate `BGCFlow` config and examples from template
bgcflow run -n # do a dry run, remove the flag "-n" to run the example dataset
-
Build and serve interactive report (after
bgcflow run
finished). The report will be served in http://localhost:8001/ :
# build a report
bgcflow build report
# show available projects
bgcflow serve
# serve interactive report
bgcflow serve --project Lactobacillus_delbrueckii
-
For detailed usage and configurations, have a look at the
BGCFlow
WIKI (under development
) :warning: -
Read more about
bgcflow_wrapper
for a detailed overview of the command line interface.
Workflow overview
The main Snakefile workflow comprises various pipelines for data selection, functional annotation, phylogenetic analysis, genome mining, and comparative genomics for Prokaryotic datasets.
Available pipelines in the main Snakefile can be checked using the following command:
bgcflow pipelines
List of Available Pipelines
Here you can find pipeline keywords that you can run using the main Snakefile of BGCflow.
Keyword | Description | Links | |
---|---|---|---|
0 | eggnog | Annotate samples with eggNOG database (http://eggnog5.embl.de) | eggnog-mapper |
1 | mash | Calculate distance estimation for all samples using MinHash. | Mash |
2 | fastani | Do pairwise Average Nucleotide Identity (ANI) calculation across all samples. | FastANI |
3 | automlst-wrapper | Simplified Tree building using autoMLST | automlst-simplified-wrapper |
4 | roary | Build pangenome using Roary. | Roary |
5 | eggnog-roary | Annotate Roary output using eggNOG mapper | eggnog-mapper |
6 | seqfu | Calculate sequence statistics using SeqFu. | seqfu2 |
7 | bigslice | Cluster BGCs using BiG-SLiCE (https://github.com/medema-group/bigslice) | bigslice |
8 | query-bigslice | Map BGCs to BiG-FAM database (https://bigfam.bioinformatics.nl/) | bigfam.bioinformatics.nl |
9 | checkm | Assess genome quality with CheckM. | CheckM |
10 | gtdbtk | Taxonomic placement with GTDB-Tk | GTDBTk |
11 | prokka-gbk | Copy annotated genbank results. | prokka |
12 | antismash | Summarizes antiSMASH result. | antismash |
13 | arts | Run Antibiotic Resistant Target Seeker (ARTS) on samples. | arts |
14 | deeptfactor | Use deep learning to find Transcription Factors. | deeptfactor |
15 | deeptfactor-roary | Use DeepTFactor on Roary outputs. | Roary |
16 | cblaster-genome | Build diamond database of genomes for cblaster search. | cblaster |
17 | cblaster-bgc | Build diamond database of BGCs for cblaster search. | cblaster |
18 | bigscape | Cluster BGCs using BiG-SCAPE | BiG-SCAPE |
References
eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Carlos P. Cantalapiedra, Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021. Molecular Biology and Evolution, msab293
eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085
Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.
Mash Screen: high-throughput sequence containment estimation for genome discovery. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM. Genome Biol. 2019 Nov 5;20(1):232. doi: 10.1186/s13059-019-1841-x.
Jain, C., Rodriguez-R, L.M., Phillippy, A.M. et al. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9, 5114 (2018). https://doi.org/10.1038/s41467-018-07641-9
Mohammad Alanjary, Katharina Steinke, Nadine Ziemert, AutoMLST: an automated web server for generating multi-locus species trees highlighting natural product potential, Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W276–W282
Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421
Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421
eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Carlos P. Cantalapiedra, Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021. Molecular Biology and Evolution, msab293
eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085
Telatin, A., Birolo, G., & Fariselli, P. SeqFu [Computer software]. GITHUB: https://github.com/telatin/seqfu2
Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154
Satria A Kautsar, Kai Blin, Simon Shaw, Tilmann Weber, Marnix H Medema, BiG-FAM: the biosynthetic gene cluster families database, Nucleic Acids Research, gkaa812, https://doi.org/10.1093/nar/gkaa812
Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154.
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2014. Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25: 1043-1055.
Chaumeil PA, et al. 2019. GTDB-Tk: A toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics, btz848.
Parks DH, et al. 2020. A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8 .
Parks DH, et al. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology, http://dx.doi.org/10.1038/nbt.4229 .
Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063
antiSMASH 6.0: improving cluster detection and comparison capabilities. Kai Blin, Simon Shaw, Alexander M Kloosterman, Zach Charlop-Powers, Gilles P van Weezel, Marnix H Medema, & Tilmann Weber. Nucleic Acids Research (2021) doi: 10.1093/nar/gkab335.
Mungan,M.D., Alanjary,M., Blin,K., Weber,T., Medema,M.H. and Ziemert,N. (2020) ARTS 2.0: feature updates and expansion of the Antibiotic Resistant Target Seeker for comparative genome mining. Nucleic Acids Res.,10.1093/nar/gkaa374
Alanjary,M., Kronmiller,B., Adamek,M., Blin,K., Weber,T., Huson,D., Philmus,B. and Ziemert,N. (2017) The Antibiotic Resistant Target Seeker (ARTS), an exploration engine for antibiotic cluster prioritization and novel drug target discovery. Nucleic Acids Res.,10.1093/nar/gkx360
Kim G.B., Gao Y., Palsson B.O., Lee S.Y. 2020. DeepTFactor: A deep learning-based tool for the prediction of transcription factors. PNAS. doi: 10.1073/pnas.2021171118
Kim G.B., Gao Y., Palsson B.O., Lee S.Y. 2020. DeepTFactor: A deep learning-based tool for the prediction of transcription factors. PNAS. doi: 10.1073/pnas.2021171118
Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421
Gilchrist, C., Booth, T. J., van Wersch, B., van Grieken, L., Medema, M. H., & Chooi, Y. (2021). cblaster: a remote search tool for rapid identification and visualisation of homologous gene clusters (Version 1.3.9) [Computer software]. https://doi.org/10.1101/2020.11.08.370601
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015) .
Gilchrist, C., Booth, T. J., van Wersch, B., van Grieken, L., Medema, M. H., & Chooi, Y. (2021). cblaster: a remote search tool for rapid identification and visualisation of homologous gene clusters (Version 1.3.9) [Computer software]. https://doi.org/10.1101/2020.11.08.370601
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015) .
Navarro-Muñoz, J.C., Selem-Mojica, N., Mullowney, M.W. et al. A computational framework to explore large-scale biosynthetic diversity. Nat Chem Biol 16, 60–68 (2020)
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | import json import logging import sys from pathlib import Path import pandas as pd log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def generate_arts_dict(df_arts): """ Convert arts table to dictionary """ arts_dict = {} arts_scaffold_count = df_arts.Source.value_counts().to_dict() for k in arts_scaffold_count.keys(): arts_dict[k] = {"counts": arts_scaffold_count[k]} arts_dict[k]["regions"] = [] for i in df_arts[df_arts.loc[:, "Source"] == k].index: regions = { "cluster_id": df_arts.loc[i, "#Cluster"], "products": df_arts.loc[i, "Type"], "location": df_arts.loc[i, "Location"], } arts_dict[k]["regions"].append(regions) return arts_dict def generate_arts_mapping(df_arts, as_json): arts_dict = generate_arts_dict(df_arts) # container for final result hit_mapper = {} contig_mapper = {} # iterate antismash json records for num, r in enumerate(as_json["records"]): # count number of detected regions per record region_count = len(r["areas"]) # if antismash detects region if region_count > 0: contig_id = r["id"] logging.info( f"Contig {contig_id} has {region_count} regions detected. Finding corresponding scaffold in ARTS2..." ) # find arts scaffold with the same region number detected arts_match = [ k for k in arts_dict.keys() if int(arts_dict[k]["counts"]) == int(region_count) ] logging.debug(f"Finding matches from: {arts_match}") for n, a in enumerate(r["areas"]): bgc_id = f"{contig_id}.region{str(n+1).zfill(3)}" location = f"{a['start']} - {a['end']}" products = ",".join(a["products"]) logging.debug(f"Finding match for: {bgc_id} | {location} | {products}") for m in arts_match: bgc_match = arts_dict[m]["regions"][n] if (bgc_match["location"] == location) and ( bgc_match["products"] == products ): logging.debug( f"Found match! {bgc_match['cluster_id']} | {bgc_match['location']} | {bgc_match['products']}" ) # logging.debug(f"Found match! {contig_id} == ARTS2 {m}") hit_mapper[bgc_match["cluster_id"]] = bgc_id contig_mapper[bgc_match["cluster_id"]] = contig_id return hit_mapper, arts_dict, contig_mapper def extract_arts_table(input_table, input_as_json, outfile, genome_id=None): """ Given an ARTS2 bgc table, map the corresponding cluster hits to antismash regions """ df_arts = pd.read_csv(input_table, sep="\t") with open(input_as_json, "r") as f: as_json = json.load(f) if genome_id is None: logging.info("Assuming genome_id from ARTS2 input") genome_id = Path(input_table).parents[1].stem logging.info(f"Extracting ARTS2 BGC hits from: {genome_id}") logging.info("Generating ARTS2 to antiSMASH region mapper...") hit_mapper, arts_dict, contig_mapper = generate_arts_mapping(df_arts, as_json) logging.info("Mapping ARTS2 cluster to antiSMASH regions...") df = pd.DataFrame(columns=df_arts.columns) for i in df_arts.index: # only extract hits if len(df_arts.loc[i, "Genelist"]) > 2: cluster_id = df_arts.loc[i, "#Cluster"] bgc_id = hit_mapper[cluster_id] contig_id = contig_mapper[cluster_id] df.loc[i, :] = df_arts.loc[i, :] df.loc[i, "#Cluster"] = bgc_id df.loc[i, "Source"] = contig_id df.loc[i, "genome_id"] = genome_id df = df.rename(columns={"#Cluster": "bgc_id"}).set_index("bgc_id") logging.info(f"Writing results to {outfile}") df.T.to_json(outfile) return if __name__ == "__main__": extract_arts_table(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | import json import logging import sys from pathlib import Path import pandas as pd from alive_progress import alive_bar log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def generate_change_dict(change_log): logging.info("Generate BGC ids mapping dict...") change_log = Path(change_log) change_log = [i for i in change_log.glob("*/*.json")] change_dict = {} for i in change_log: with open(i, "r") as f: data = json.load(f) for k in data.keys(): change_dict[k] = data[k] return change_dict def correct_arts_bgc_ids(arts_json, change_dict, genome_id): logging.debug(f"Correcting BGC ids for {genome_id}") output = {} with open(arts_json, "r") as f: query = json.load(f) for k in query.keys(): correct_bgc_id = Path( change_dict[genome_id][f"{k}.gbk"]["symlink_path"] ).stem output[correct_bgc_id] = query[k] return output def combine_arts_json(input_json, change_log_path, table): logging.info("Combining and correcting ARTS output...") input_json = Path(input_json) logging.info(input_json) if input_json.is_file() and input_json.suffix == ".json": logging.info(f"Getting ARTS overview from a single file: {input_json}") input_json_files = input_json elif input_json.is_file() and input_json.suffix == ".txt": logging.info(f"Getting ARTS overview from a text file: {input_json}") with open(input_json, "r") as file: file_content = [i.strip("\n") for i in file.readlines()] if len(file_content) == 1: # Paths space-separated on a single line paths = file_content[0].split() else: # Paths written on separate lines paths = file_content input_json_files = [ Path(path) for path in paths if Path(path).suffix == ".json" ] else: input_json_files = [ Path(file) for file in str(input_json).split() if Path(file).suffix == ".json" ] logging.info( f"Getting ARTS overview from the given list of {len(input_json_files)} files..." ) container = {} change_dict = generate_change_dict(change_log_path) with alive_bar(len(input_json_files), title="Merging json:") as bar: for j in input_json_files: arts_json = Path(j) genome_id = arts_json.stem value = correct_arts_bgc_ids(arts_json, change_dict, genome_id) container.update(value) bar() df = pd.DataFrame.from_dict(container).T df.index.name = "bgc_id" logging.debug(f"Writing file to: {table}") # Save dataframes to csv tables df_table = Path(table) df_table.parent.mkdir(parents=True, exist_ok=True) df.to_csv(table) logging.info("Job done") return None if __name__ == "__main__": combine_arts_json(sys.argv[1], sys.argv[2], sys.argv[3]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | import logging import sys from pathlib import Path import pandas as pd log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.INFO) def csv_to_parquet(project_folder, output_folder="."): """ Given a list of csv files, convert into parquet files """ project_folder = Path(project_folder) if output_folder == ".": output_folder = project_folder / "data_warehouse" else: pass logging.info( f"Grabbing all csv from folder {project_folder} and saving parquets in {output_folder}" ) for i in project_folder.rglob("*.csv"): if "docs" in str(i): pass elif "dbt" in str(i): pass elif "assets" in str(i): pass elif "ipynb_checkpoints" in str(i): pass else: category = str(i).split("/")[3] output_subfolder = output_folder / category df = pd.read_csv(i) # df = df.fillna("") # df.columns = [c.replace(".", "_") for c in df.columns] output_parquet = output_subfolder / f"{i.stem}.parquet" logging.info(f"Converting {i} to {output_parquet}") output_subfolder.mkdir(parents=True, exist_ok=True) df.to_parquet(output_parquet) return if __name__ == "__main__": csv_to_parquet(sys.argv[1]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | import json import logging import sys from pathlib import Path from Bio import SeqIO log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def bgc_downstream_prep(input_dir, output_dir, selected_bgcs=False): """ Given an antiSMASH directory, check for changed name """ logging.info(f"Reading input directory: {input_dir}") path = Path(input_dir) if not path.is_dir(): raise FileNotFoundError(f"No such file or directory: {path}") genome_id = path.name logging.debug(f"Deducting genome id as {genome_id}") change_log = {genome_id: {}} for gbk in path.glob("*.gbk"): logging.info(f"Parsing file: {selected_bgcs}") logging.info(f"Parsing file: {gbk.name}") region = SeqIO.parse(str(gbk), "genbank") for record in region: record_log = {} if "comment" in record.annotations: filename = gbk.name try: original_id = record.annotations["structured_comment"][ "antiSMASH-Data" ]["Original ID"].split()[0] except KeyError: original_id = record.id logging.warning( f"Found shortened record.id: {record.id} <- {original_id}." ) # generate symlink outpath = Path(output_dir) / genome_id outpath.mkdir(parents=True, exist_ok=True) new_filename = filename.replace(record.id, original_id) target_path = Path.cwd() / gbk # target for symlink link = outpath / new_filename logging.info(f"Generating symlink: {link}") try: link.symlink_to(target_path) except FileExistsError: logging.warning( f"Previous symlink exist, updating target: {link} -> {target_path}" ) link.unlink() link.symlink_to(target_path) record_log["record_id"] = record.id record_log["original_id"] = original_id record_log["target_path"] = str(gbk) record_log["symlink_path"] = str(link) change_log[genome_id][filename] = record_log with open( outpath / f"{genome_id}-change_log.json", "w", encoding="utf8" ) as json_file: json.dump(change_log, json_file, indent=4) logging.info(f"{genome_id}: Job done!\n") return if __name__ == "__main__": bgc_downstream_prep(sys.argv[1], sys.argv[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import logging import sys from pathlib import Path log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def bigscape_copy(input_index, output_main): input = Path(input_index) output = Path(output_main) output.mkdir(parents=True, exist_ok=True) for item in input.parent.glob("*"): target = output / item.name logging.debug(f"Generating symlink for: {target} --> {item.resolve()}") try: target.symlink_to(item.resolve(), target_is_directory=True) except FileExistsError as e: logging.debug(f"Got error:\n{e}") return if __name__ == "__main__": bigscape_copy(sys.argv[1], sys.argv[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import sys import pandas as pd def prep_bigslice(df_gtdb_output, df_bigslice_output): df = pd.read_csv(df_gtdb_output) # Format to BiG-Slice df_bigslice = df.rename(columns={"genome_id": "#Genome Folder"}) df_bigslice["#Genome Folder"] = df_bigslice["#Genome Folder"].apply( lambda x: f"{x}/" ) df_bigslice.to_csv(df_bigslice_output, index=False, sep="\t") return None if __name__ == "__main__": prep_bigslice(sys.argv[1], sys.argv[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | import logging import sys from pathlib import Path import pandas as pd log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) # create intermediate folder def prepare_bigslice(input_folder, taxonomy_file, project_name, output_folder): """ Preparing bigslice input folder for a single project """ logging.info(f"Preparing bigslice input folder for {project_name}") input_folder = Path(input_folder) output_folder = Path(output_folder) assert input_folder.is_dir() output_folder.mkdir(parents=True, exist_ok=True) # create symlink to antismash logging.info("Creating symlink to AntiSMASH BGCs...") antismash_dir = output_folder / project_name try: antismash_dir.symlink_to(input_folder.resolve(), target_is_directory=True) except FileExistsError: pass # create tax info logging.info("Getting taxonomic information...") taxonomy_dir = output_folder / "taxonomy" taxonomy_dir.mkdir(parents=True, exist_ok=True) df_tax = pd.read_csv(taxonomy_file, sep="\t") df_tax = df_tax.loc[ :, [ "#Genome Folder", "Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species", "Organism", ], ] df_tax.to_csv(taxonomy_dir / f"{project_name}.tsv", sep="\t", index=False) # build metadata logging.info("Preparing metadata...") metadata = { "# Dataset name": project_name, "Path to folder": f"{output_folder.name}/", "Path to taxonomy": f"taxonomy/{project_name}.tsv", "Description": project_name, } df_metadata = pd.DataFrame.from_dict(metadata, orient="index").T df_metadata.to_csv(output_folder / "datasets.tsv", sep="\t", index=False) logging.info("Job done!") return if __name__ == "__main__": prepare_bigslice(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | import logging import sys import numpy as np import pandas as pd log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def triangle_convert(matrix, outfile): df_raw = pd.read_csv(matrix, index_col=0) genome_id_list = [] for idx in df_raw.index: genome_id = idx.split("\t")[0].split("/")[-1].split(".fna")[0] genome_id_list.append(genome_id) df = pd.DataFrame(0, index=genome_id_list, columns=genome_id_list) for idx in df_raw.index: genome_id = idx.split("\t")[0].split("/")[-1].split(".fna")[0] for cntr in range(len(idx.split("\t"))): if cntr > 0: value = idx.split("\t")[cntr] # if value is NA, replace with numpy null if value == "NA": value = np.nan else: value = float(value) df.loc[genome_id, genome_id_list[cntr - 1]] = value df.loc[genome_id_list[cntr - 1], genome_id] = value df.index.name = "genome_id" df.to_csv(outfile) return if __name__ == "__main__": triangle_convert(sys.argv[1], sys.argv[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | import json import logging import sys from pathlib import Path import pandas as pd from alive_progress import alive_bar log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def combine_deeptfactor_prediction(input_json, filter_str="_deeptfactor"): container = {} logging.info("Reading json files...") with alive_bar(len(input_json), title="Merging DeepTFactor prediction:") as bar: for item in input_json: item = Path(item) genome_id = item.stem if filter_str in genome_id: genome_id = genome_id.replace(filter_str, "") logging.debug(f"Processing {genome_id}") with open(item, "r") as f: data = json.load(f) container.update(data) bar() return container def write_deeptf_table(input_json, deeptf_table): """ Write df_deeptfactor.csv table in processed data """ # Handle multiple json input_json = Path(input_json) logging.info(input_json) if input_json.is_file() and input_json.suffix == ".json": logging.info(f"Getting deepTFactor overview from a single file: {input_json}") input_json_files = input_json elif input_json.is_file() and input_json.suffix == ".txt": logging.info(f"Getting deepTFactor overview from a text file: {input_json}") with open(input_json, "r") as file: file_content = [i.strip("\n") for i in file.readlines()] if len(file_content) == 1: # Paths space-separated on a single line paths = file_content[0].split() else: # Paths written on separate lines paths = file_content input_json_files = [ Path(path) for path in paths if Path(path).suffix == ".json" ] else: input_json_files = [ Path(file) for file in str(input_json).split() if Path(file).suffix == ".json" ] logging.info( f"Getting deepTFactor overview from the given list of {len(input_json_files)} files..." ) df = combine_deeptfactor_prediction(input_json_files) df = pd.DataFrame.from_dict(df).T df.index.name = "locus_tag" logging.debug(f"Writing file to: {deeptf_table}") # Save dataframes to csv tables df_table = Path(deeptf_table) df_table.parent.mkdir(parents=True, exist_ok=True) df.to_csv(deeptf_table, index=True) logging.info("Job done") return None if __name__ == "__main__": write_deeptf_table(sys.argv[1], sys.argv[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import sys from pathlib import Path import pandas as pd def deeptfactor_to_json(prediction, outfile, keep_false=True): genome_id = Path(prediction).parent.stem df = pd.read_csv(prediction, sep="\t") df.loc[:, "genome_id"] = genome_id # rename columns df = df.rename( columns={"prediction": "deeptfactor_prediction", "score": "deeptfactor_score"} ) df = df.set_index("sequence_ID", drop=True) if keep_false: pass else: df = df[df.deeptfactor_prediction is True] df.T.to_json(outfile) return if __name__ == "__main__": deeptfactor_to_json(sys.argv[1], sys.argv[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | import logging import sys from pathlib import Path import pandas as pd log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def extract_ncbi_information(assembly_report_path, outfile): """ Extract NCBI assembly json reports generated by get_assembly_information.py in a given directory into a csv file Parameters ---------- 1. assembly_report_path : str / path Three different input types can be used: - A directory containing assembly information json files generated by get_assembly_information.py - A single json file generated by get_assembly_information.py - A list of files as string separated by spaces (put the string inside '' in bash expression) - A text file (.txt) containing the paths of the json files (one per line, or space-separated on a single line) 2. outfile : str / path Location of the output csv file Returns ------- 1. outfile : .csv file A comma-separated table summarizing the NCBI assembly reports metadata """ assembly_report_path = Path(assembly_report_path) outfile = Path(outfile) if assembly_report_path.is_dir(): logging.info( f"Summarizing NCBI assembly information from {assembly_report_path}..." ) ncbi_json_files = [ file for file in assembly_report_path.iterdir() if file.is_file() and file.suffix == ".json" ] elif assembly_report_path.is_file() and assembly_report_path.suffix == ".json": logging.info( f"Getting NCBI assembly information from a single file: {assembly_report_path}" ) ncbi_json_files = [assembly_report_path] elif assembly_report_path.is_file() and assembly_report_path.suffix == ".txt": logging.info( f"Reading NCBI assembly information from a text file: {assembly_report_path}" ) with open(assembly_report_path, "r") as file: file_content = [i.strip("\n") for i in file.readlines()] if len(file_content) == 1: # Paths space-separated on a single line paths = file_content[0].split() else: # Paths written on separate lines paths = file_content ncbi_json_files = [ Path(path) for path in paths if Path(path).suffix == ".json" ] else: ncbi_json_files = [ Path(file) for file in str(assembly_report_path).split() if Path(file).suffix == ".json" ] logging.info( f"Summarizing NCBI assembly information from the given list of {len(ncbi_json_files)} files..." ) dfs = [] # an empty list to store the data frames for file in ncbi_json_files: data = pd.read_json(file).T # read data frame from json file dfs.append(data) # append the data frame to the list df_ncbi = pd.concat(dfs) # concatenate all the data frames in the list. df_ncbi.reset_index(inplace=True) df_ncbi = df_ncbi.rename(columns={"index": "genome_id"}) logging.info(f"Summarized {len(df_ncbi.index)} NCBI assembly information.") df_ncbi.to_csv(outfile, index=False) return None if __name__ == "__main__": extract_ncbi_information(sys.argv[1], sys.argv[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | import sys import pandas as pd def extract_ncbi_information( ncbi_meta_path, patric_genome_summary, patric_genome_metadata, patric_meta_path ): """ Extract PATRIC assembly json reports generated by get_assembly_information.py in a given directory into a csv file Parameters ---------- 1. assembly_report_path : str / path Three different input types can be used: - A directory containing assembly information json files generated by get_assembly_information.py - A single json file generated by get_assembly_information.py - A list of files as string separated by spaces (put the string inside '' in bash expression) 2. outfile : str / path Location of the output csv file Returns ------- 1. outfile : .csv file A comma separated table summarizing the NCBI assembly reports metadata ''' """ df_ncbi = pd.read_csv(ncbi_meta_path, index_col="genome_id") df_patric_genome_summary = pd.read_csv( patric_genome_summary, sep="\t", dtype={"genome_id": str, "genome_name": str, "taxon_id": str, "strain": str}, ) df_patric_genome_summary.set_index("genome_id", inplace=True) df_patric_genome_metadata = pd.read_csv( patric_genome_metadata, sep="\t", dtype={"genome_id": str, "genome_name": str, "taxon_id": str}, ) df_patric_genome_metadata.set_index("genome_id", inplace=True) df_patric_genome_metadata = df_patric_genome_metadata[ df_patric_genome_metadata["genome_status"] != "Plasmid" ] df_patric_meta = pd.DataFrame() for genome_id in df_ncbi.index: refseq = df_ncbi.loc[genome_id, "refseq"] genbank = df_ncbi.loc[genome_id, "genbank"] if refseq in df_patric_genome_metadata["assembly_accession"].tolist(): patric_genome_ids_list = df_patric_genome_metadata[ df_patric_genome_metadata["assembly_accession"] == refseq ].index.tolist() elif genbank in df_patric_genome_metadata["assembly_accession"].tolist(): patric_genome_ids_list = df_patric_genome_metadata[ df_patric_genome_metadata["assembly_accession"] == genbank ].index.tolist() for patric_genome_id in patric_genome_ids_list: df_patric_meta.loc[patric_genome_id, "bgcflow_genome_id"] = genome_id df_patric_meta.loc[ patric_genome_id, df_patric_genome_metadata.columns ] = df_patric_genome_metadata.loc[ patric_genome_id, df_patric_genome_metadata.columns ] if patric_genome_id in df_patric_genome_summary.index: for col in df_patric_genome_summary.columns: if col not in df_patric_genome_metadata.columns: df_patric_meta.loc[ patric_genome_id, col ] = df_patric_genome_summary.loc[patric_genome_id, col] df_patric_meta.index.name = "patric_genome_id" df_patric_meta.to_csv(patric_meta_path) return None if __name__ == "__main__": extract_ncbi_information(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | import json import logging import sys from pathlib import Path import pandas as pd log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def summarize_gtdb_json(accession_list, df_gtdb_output): """ Given a string of json filepaths generated by gtdb_prep.py, merge them together into one csv table """ # Reading input logging.info("Reading GTDB metadata .json files...") input_json = Path(accession_list) if input_json.is_file() and input_json.suffix == ".json": logging.info(f"Getting GTDB taxonomy from a single file: {input_json}") input_json_files = input_json elif input_json.is_file() and input_json.suffix == ".txt": logging.info(f"Getting GTDB taxonomy from a text file: {input_json}") with open(input_json, "r") as file: file_content = [i.strip("\n") for i in file.readlines()] if len(file_content) == 1: # Paths space-separated on a single line paths = file_content[0].split() else: # Paths written on separate lines paths = file_content input_json_files = [ Path(path) for path in paths if Path(path).suffix == ".json" ] else: input_json_files = [ Path(file) for file in str(input_json).split() if Path(file).suffix == ".json" ] logging.info( f"Getting GTDB_taxonomy from the given list of {len(input_json_files)} files..." ) accession = input_json_files out = [] for a in accession: logging.info(f"Reading {a}...") with open(a, "r") as f: out.append(json.load(f)) df = pd.DataFrame(out).set_index("genome_id", drop=False) # Getting taxonomic information logging.info("Getting taxonomic information...") df_taxonomy = pd.DataFrame.from_dict( {i: df.loc[i, "gtdb_taxonomy"] for i in df.index} ).T # Split between species (specific epithet) and organism df_taxonomy["organism"] = df_taxonomy["species"] for idx in df_taxonomy.index: try: df_taxonomy.loc[idx, "species"] = df_taxonomy.loc[idx, "species"].split( " " )[1] except IndexError: # leave blank for empty taxonomy pass # Tidying up df_taxonomy.columns = [c.title() for c in df_taxonomy.columns] # Getting other metadata try: logging.info("Getting metadata into table...") if "metadata" not in df.columns: logging.warning( "metadata is not in the column information. Adding default values..." ) df["metadata"] = [{"genome_id": genome_id} for genome_id in df.index] if "gtdb_release" not in df.columns: logging.warning( "gtdb_release is not in the column information. Adding default values..." ) df["gtdb_release"] = "unknown" metadata = pd.DataFrame.from_dict( {i: df.loc[i, "metadata"] for i in df.index} ).T metadata_container = [] for c in metadata.columns: value = {i: metadata.loc[i, c] for i in metadata.index} try: metadata_container.append(pd.DataFrame.from_dict(value).T) except ValueError: metadata_container.append( pd.DataFrame([value]).T.rename(columns={0: c}) ) df_metadata = pd.concat(metadata_container, axis=1) logging.info("Finalizing table...") df_final = pd.concat( [df.loc[:, ["genome_id", "gtdb_release"]], df_taxonomy, df_metadata], axis=1 ) except KeyError: logging.info("No additional metadata found.") logging.info("Finalizing table...") df_final = pd.concat( [df.loc[:, ["genome_id", "gtdb_release"]], df_taxonomy], axis=1 ) # save to file logging.info(f"Writing to file: {df_gtdb_output}") df_final.to_csv(df_gtdb_output, index=False) return None if __name__ == "__main__": summarize_gtdb_json(sys.argv[1], sys.argv[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 | import json import logging import re import subprocess import sys from datetime import datetime from pathlib import Path import pandas as pd from Bio import SeqIO log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def get_git_version(): """ Get the sha1 of the current git version """ git_version = "" try: version_cmd = subprocess.run( ["git", "rev-parse", "--short", "HEAD"], universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, ) status_cmd = subprocess.run( ["git", "status", "--porcelain"], universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, ) git_version = str(version_cmd.stdout.strip()) changes = str(status_cmd.stdout).strip() if changes: git_version += "(changed)" except OSError: pass return git_version def get_version(version): """ Get the current version string """ git_version = get_git_version() if git_version: version += "-%s" % git_version return version def add_bgcflow_comments(gbk_in_path, version, json_path, genome_id, gbk_out_path): """ Add bgcflow meta-annotation to genbank output """ version = get_version(version) logging.info(f"Formatting genbank file: {gbk_in_path}") temp_gbk = Path(gbk_in_path).parent / f"{genome_id}-change_log.gbk" try: [record for record in SeqIO.parse(gbk_in_path, "genbank")] except ValueError as e: logging.warning(f"Parsing fail: {e}") logging.info("Attempting to fix genbank file...") change_log_path = Path(gbk_in_path).parent / f"{genome_id}-change_log.json" change_log = correct_collided_headers( gbk_in_path, genome_id, temp_gbk, json_dump=change_log_path ) logging.info("Retry parsing with Bio.SeqIO...") if temp_gbk.is_file(): records = SeqIO.parse(temp_gbk, "genbank") temp_gbk.unlink() else: records = SeqIO.parse(gbk_in_path, "genbank") df_gtdb_meta = pd.read_json(json_path, orient="index").T df_gtdb_meta.fillna("Unclassified", inplace=True) df_tax = pd.DataFrame([i for i in df_gtdb_meta.gtdb_taxonomy]) for col in df_tax.columns: df_tax[col] = df_tax[col].apply(lambda x: x.split("__")[1]) df_tax.columns = [c.title() for c in df_tax.columns] tax_levels = ["Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"] taxonomy_str = df_tax.loc[0, tax_levels].tolist() logging.debug(f"Taxonomy found: {taxonomy_str}") bgcflow_comment = ( "##BGCflow-Data-START##\n" "Version :: {version}\n" "Run date :: {date}\n" ).format( version=version, date=str(datetime.now().strftime("%Y-%m-%d %H:%M:%S")), ) new_records = [] ctr = 0 for record in records: logging.info(f"Formatting record {ctr}...") comment = bgcflow_comment if "comment" in record.annotations: record.annotations["comment"] += "\n" + comment else: record.annotations["comment"] = comment try: if not change_log[ctr]["accept_format"]: record.annotations[ "comment" ] += f"Original ID :: {change_log[ctr]['original_id']}\n" except UnboundLocalError: pass record.annotations["comment"] += "##BGCflow-Data-END##" if "organism" in record.annotations: organism = record.annotations["organism"] if "Unclassified" in organism: record.annotations["organism"] = organism.split(" Unclassified")[ 0 ].strip() record.annotations["taxonomy"] = taxonomy_str new_records.append(record) ctr = ctr + 1 logging.info(f"Writing final output: {gbk_out_path}") with open(gbk_out_path, "w") as output_handle: SeqIO.write(new_records, output_handle, "genbank") # -- Correcting IDs --# # 1. Traverse text, make a library of record header, length, and ids # 2. Check the library for collided locus name and length # 3. Propose changes # 4. Make sure changes is unique # 5. Write a new genbank file and recorded changes as json file def correct_collided_headers( gbk_in_path, accession_id, outfile, json_dump="header.json" ): record_headers = get_record_headers(gbk_in_path) record_headers = shorten_record_headers(record_headers, accession_id) with open(json_dump, "w", encoding="utf8") as json_file: json.dump(record_headers, json_file, indent=4) logging.info(f"Writing result to: {outfile}") with open(gbk_in_path) as f: s = f.read() with open(outfile, "w") as f: for k in record_headers.keys(): if not record_headers[k]["accept_format"]: s = s.replace( record_headers[k]["original_header"], record_headers[k]["new_header"], ) f.write(s) return record_headers def get_record_headers(gbk_in_path): """ Get the header information for each records in a genbank file. Returns a json-like python dictionary. """ record_headers = {} # Find all headers with open(gbk_in_path, "r") as file: logging.info(f"Reading file as text: {gbk_in_path}") ctr = 0 for line in file: if line.startswith("LOCUS"): record_headers[ctr] = {"original_header": line} if "source" in line: length = line.split()[-1].split("..")[-1] record_headers[ctr]["length"] = length ctr = ctr + 1 logging.debug(f"Found {len(record_headers)} records.") # Check for collided headers logging.info("Checking header format...") for k in record_headers.keys(): query = record_headers[k]["original_header"].split() if query != 7 and query[3] != "bp": logging.warning("Record name and length collide in the LOCUS line") query_length = record_headers[k]["length"] if query_length in query[1]: original_id = query[1].replace(query_length, "") record_headers[k]["original_id"] = original_id record_headers[k]["accept_format"] = False else: record_headers[k]["original_id"] = query[1] record_headers[k]["accept_format"] = True return record_headers def modify_id(accession_id, original_id, locus_index, record_counts, id_collection): """ Attempt to fix collided name and length in the locus name by shortening id """ logging.info(f"{original_id}: Attempting to change locus name...") # For refseq record, remove the accession prefix (first two digits) from string if accession_id.startswith("GCF"): logging.debug( f"{original_id}: Assuming locus came from Refseq. Removing refseq accession prefix." ) refseq_type, refseq_number, refseq_version = re.split("_|[.]", original_id) new_id = f"{refseq_number}.{refseq_version}" elif accession_id.startswith("GCA"): logging.info( f"{original_id}: Assuming locus came from Genbank. Removing version from locus name..." ) new_id, genbank_version = re.split("[.]", original_id) # For unknown source remove last 4 digit, add the contig index in front. # example: NZAJABAQG010000001.1 --> c1|NZAJABAQG010000 else: logging.info( f"{original_id}: Cannot determine source. Shortening locus name..." ) digit = len(str(record_counts)) contig_number = str(locus_index + 1) new_id = f"c{contig_number.zfill(digit)}_{original_id[:-(digit + 4)]}" # make sure new_id is unique logging.info(f"{original_id}: Making sure id is unique...") if new_id in id_collection: raise else: logging.debug(f"{original_id}: shortened to {new_id}") return new_id def shorten_record_headers(record_headers, accession_id): """ Shorten record headers """ record_counts = len(record_headers.keys()) id_collection = [record_headers[k]["original_id"] for k in record_headers.keys()] for k in record_headers.keys(): if record_headers[k]["accept_format"]: pass else: original_id = record_headers[k]["original_id"] old_header = record_headers[k]["original_header"] # Correct header name new_id = modify_id( accession_id, original_id, k, record_counts, id_collection ) new_id_replacement_value = ( f"{new_id}{(len(original_id) - len(new_id)) * ' '}" ) new_header = old_header.replace(original_id, new_id_replacement_value) # Add value to dict record_headers[k]["new_id"] = new_id record_headers[k]["new_header"] = new_header return record_headers if __name__ == "__main__": add_bgcflow_comments( sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5] ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | import json import logging import sys from pathlib import Path import pandas as pd log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def correct_bgc_id_overview(overview_file, mapping_file, genome_id=False): """ Use a mapping file to correct bgc_ids Parameters ---------- overview_file : str / path Path to the BGC overview file in JSON format mapping_file : str / path Path to the mapping file in JSON format genome_id : str, optional Genome ID to correct BGC IDs for, if not provided, it is extracted from the overview file name Returns ------- new_dict : dict Corrected BGC overview dictionary with updated BGC IDs """ logging.info(f"Correcting shortened bgc ids for {genome_id}...") overview_path = Path(overview_file) mapping_path = Path(mapping_file) with open(overview_path, "r") as f: overview = json.load(f) with open(mapping_path, "r") as f: mapping = json.load(f) if not genome_id: genome_id = overview_path.stem.strip("_bgc_overview") else: pass new_dict = {} for bgc_id in overview.keys(): for m in mapping[genome_id].keys(): if bgc_id in m: log_change = mapping[genome_id][m] correct_bgc_id = Path(log_change["symlink_path"]).stem if log_change["record_id"] != log_change["original_id"]: logging.debug(f"Replacing {bgc_id} to {correct_bgc_id}") new_dict[correct_bgc_id] = overview[bgc_id] new_dict[correct_bgc_id]["accession"] = log_change["original_id"] else: new_dict[correct_bgc_id] = overview[bgc_id] pass new_dict[correct_bgc_id]["source"] = "bgcflow" new_dict[correct_bgc_id]["gbk_path"] = log_change["target_path"] logging.info("Done!") return new_dict def gather_bgc_overview(input_json, mapping_dir, table): """ Gather BGC overview data from multiple JSON files and create a merged table Parameters ---------- input_json : str Two different input types can be used: - Space-separated paths to BGC overview JSON files (put the string inside '' in bash expression) - A text file (.txt) containing the paths of the json files (one per line, or space-separated on a single line) mapping_dir : str / path Directory containing mapping files table : str / path Path to the output merged table Returns ------- None """ input_json = Path(input_json) logging.info(input_json) if input_json.is_file() and input_json.suffix == ".json": logging.info(f"Getting BGC overview from a single file: {input_json}") input_json_files = input_json elif input_json.is_file() and input_json.suffix == ".txt": logging.info(f"Getting BGC overview from a text file: {input_json}") with open(input_json, "r") as file: file_content = [i.strip("\n") for i in file.readlines()] if len(file_content) == 1: # Paths space-separated on a single line paths = file_content[0].split() else: # Paths written on separate lines paths = file_content input_json_files = [ Path(path) for path in paths if Path(path).suffix == ".json" ] else: input_json_files = [ Path(file) for file in str(input_json).split() if Path(file).suffix == ".json" ] logging.info( f"Getting BGC overview from the given list of {len(input_json_files)} files..." ) merged_dict = {} for j in input_json_files: mapping_file = Path(j) genome_id = mapping_file.name.replace("_bgc_overview.json", "") mapping_path = Path(mapping_dir) / f"{genome_id}/{genome_id}-change_log.json" corrected = correct_bgc_id_overview(mapping_file, mapping_path, genome_id) merged_dict.update(corrected) df = pd.DataFrame.from_dict(merged_dict).T df.index.name = "bgc_id" logging.debug(f"Writing file to: {table}") # Save dataframes to csv tables df_table = Path(table) df_table.parent.mkdir(parents=True, exist_ok=True) df.to_csv(table) logging.info("Job done") return None if __name__ == "__main__": gather_bgc_overview(sys.argv[1], sys.argv[2], sys.argv[3]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | import json import logging import sys from pathlib import Path log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def region_table_builder(f, accession): """ Given a feature of a record, return value to build database table """ # grab region values region_number = f["qualifiers"]["region_number"][0] region_id = f"{accession}.region{str(region_number).zfill(3)}" location = f["location"].strip("[").strip("]") start_pos, end_pos = location.split(":") contig_edge = f["qualifiers"]["contig_edge"][0] # fill values value = { region_id: { "accession": accession, "region_number": region_number, "location": location, "start_pos": start_pos, "end_pos": end_pos, "contig_edge": contig_edge, "product": f["qualifiers"]["product"], "rules": f["qualifiers"]["rules"], } } return value def get_antismash_overview(json_path, outfile, genome_id=False, n_hits=1): """ Read antismash json output and get the summary page into a json format """ path = Path(json_path) with open(path, "r") as f: data = json.load(f) logging.info(f"Processing: {json_path}, custom genome_id: {genome_id}") if not genome_id: genome_id = data["input_file"].strip(".gbk") else: pass logging.debug(f"Genome id: {genome_id}") # iterating over record output = {} for r, record in enumerate(data["records"]): logging.info(f"Getting antismash regions from record: {record['id']}") region_db = {} region_feat = [i for i in record["features"] if i["type"] == "region"] for f in region_feat: region_db.update(region_table_builder(f, record["id"])) for c, area in enumerate(record["areas"]): cluster_id = f"{r+1}.{c+1}" output_cluster = {} logging.info(f"Grabbing information from region {cluster_id}") # _from = area['start'] # _to = area['end'] # products = area['products'] knownclusterblast = record["modules"]["antismash.modules.clusterblast"][ "knowncluster" ]["results"][c] assert n_hits > 0 output_hits = [] for n, hits in enumerate(knownclusterblast["ranking"]): if n + 1 <= (n_hits): most_similar_mibig_id = hits[0]["accession"] most_similar_mibig_description = hits[0]["description"] most_similar_mibig_clustertype = hits[0]["cluster_type"] n_genes_in_target = len(hits[0]["tags"]) n_genes_hits = hits[1]["hits"] hit_similarity = n_genes_hits / n_genes_in_target output_hits.append( { "most_similar_known_cluster_id": most_similar_mibig_id, "most_similar_known_cluster_description": most_similar_mibig_description, "most_similar_known_cluster_type": most_similar_mibig_clustertype, "similarity": hit_similarity, } ) else: pass bgc_id = f"{record['id']}.region{str(c+1).zfill(3)}" output_cluster = { "genome_id": genome_id, "region": cluster_id, } for column in [ "accession", "start_pos", "end_pos", "contig_edge", "product", ]: output_cluster[column] = region_db[bgc_id][column] try: output_cluster["region_length"] = int(output_cluster["end_pos"]) - int( output_cluster["start_pos"] ) except ValueError: logging.warning( f'Error calculating region length. Region might be incomplete: {output_cluster["start_pos"]}:{output_cluster["end_pos"]}' ) start_pos = "".join( [s for s in output_cluster["start_pos"] if s.isdigit()] ) logging.warning( f'Correcting start position from {output_cluster["start_pos"]} to {start_pos}' ) output_cluster["start_pos"] = start_pos output_cluster["region_length"] = int(output_cluster["end_pos"]) - int( output_cluster["start_pos"] ) if len(output_hits) == 1: for k in output_hits[0].keys(): output_cluster[k] = output_hits[0][k] output[bgc_id] = output_cluster with open(outfile, "w") as f: logging.info(f"Writing results to {outfile}") json.dump(output, f, indent=2) return output if __name__ == "__main__": get_antismash_overview(sys.argv[1], sys.argv[2], sys.argv[3]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | import os import sys import pandas as pd def get_ncbi_meta(assembly_report_path, outfile=None, genome_id=None): """ Read NCBI assembly reports downloaded using ncbi-genome-download https://github.com/kblin/ncbi-genome-download and convert it to a json format. The assembly report can be downoloaded by using assembly_report as --formats while running ncbi-genome-download. Parameters ---------- 1. assembly_report_path : str / path Location of the assembly information json file generated by get_assembly_information.py 2. outfile : str / path Location of the output csv file 3. genome_id : str Genome accession id Returns ------- 1. outfile : .json file A json file summarizing the NCBI assembly reports metadata. If genome_id is not provided, the script will determine the genome id from the file basename. If outfile is not defined, it will generate a json file in the current directory with this format: {genome_id}.json. """ if genome_id is None: genome_id = os.path.splitext(os.path.basename(assembly_report_path))[0] if outfile is None: outfile = f"{genome_id}.json" # List of columns in df_ncbi_meta ncbi_meta_columns = [ "assembly", "organism", "genus", "species", "strain", "tax_id", "refseq_category", "refseq", "genbank", "assembly_type", "release_type", "assembly_level", "genome_representation", "refseq_genbank_identity", "biosample", "submitter", "date", ] df_ncbi_meta = pd.DataFrame(index=[genome_id], columns=ncbi_meta_columns) df_ncbi_meta.index.name = "genome_id" with open(assembly_report_path, "r") as report_file: lines = report_file.readlines() strain_found = False for line in lines: if line.startswith("# Assembly name:"): assembly_accn = line.split("Assembly name:")[1].strip() df_ncbi_meta.loc[genome_id, "assembly"] = assembly_accn elif line.startswith("# Organism name"): organism = line.split("Organism name:")[1].strip() df_ncbi_meta.loc[genome_id, "organism"] = organism df_ncbi_meta.loc[genome_id, "genus"] = organism.strip().split(" ")[0] df_ncbi_meta.loc[genome_id, "species"] = organism.strip().split(" ")[1] elif line.startswith("# Infraspecific name:"): strain = line.split("Infraspecific name:")[1].strip() if "strain=" in strain: df_ncbi_meta.loc[genome_id, "strain"] = strain.split("strain=")[1] strain_found = True elif line.startswith("# Isolate:"): if strain_found is False: strain = line.split("Isolate:")[1].strip() df_ncbi_meta.loc[genome_id, "strain"] = strain strain_found = True elif line.startswith("# Taxid"): tax_id = line.split("Taxid:")[1].strip() df_ncbi_meta.loc[genome_id, "tax_id"] = tax_id elif line.startswith("# BioSample"): biosample = line.split("BioSample:")[1].strip() df_ncbi_meta.loc[genome_id, "biosample"] = biosample elif line.startswith("# BioProject"): bioproject = line.split("# BioProject:")[1].strip() df_ncbi_meta.loc[genome_id, "BioProject"] = bioproject elif line.startswith("# Submitter"): submitter = line.split("Submitter:")[1].strip() df_ncbi_meta.loc[genome_id, "submitter"] = submitter elif line.startswith("# Date"): date = line.split("Date:")[1].strip() df_ncbi_meta.loc[genome_id, "date"] = date elif line.startswith("# Assembly type:"): assembly_type = line.split("Assembly type:")[1].strip() df_ncbi_meta.loc[genome_id, "assembly_type"] = assembly_type elif line.startswith("# Release type:"): release_type = line.split("Release type:")[1].strip() df_ncbi_meta.loc[genome_id, "release_type"] = release_type elif line.startswith("# Assembly level:"): assembly_level = line.split("Assembly level:")[1].strip() df_ncbi_meta.loc[genome_id, "assembly_level"] = assembly_level elif line.startswith("# Genome representation:"): genome_representation = line.split("Genome representation:")[1].strip() df_ncbi_meta.loc[ genome_id, "genome_representation" ] = genome_representation elif line.startswith("# RefSeq category"): refseq_category = line.split("RefSeq category:")[1].strip() df_ncbi_meta.loc[genome_id, "refseq_category"] = refseq_category elif line.startswith("# RefSeq assembly accession"): refseq = line.split("RefSeq assembly accession:")[1].strip() df_ncbi_meta.loc[genome_id, "refseq"] = refseq elif line.startswith("# GenBank assembly accession"): genbank = line.split("GenBank assembly accession:")[1].strip() df_ncbi_meta.loc[genome_id, "genbank"] = genbank elif line.startswith("# RefSeq assembly and GenBank assemblies identical"): refseq_genbank_identity = line.split( "RefSeq assembly and GenBank assemblies identical:" )[1].strip() df_ncbi_meta.loc[ genome_id, "refseq_genbank_identity" ] = refseq_genbank_identity if strain_found is False: df_ncbi_meta.loc[genome_id, "strain"] = genome_id df_ncbi_meta.to_json(outfile, orient="index", indent=4) return None if __name__ == "__main__": get_ncbi_meta(sys.argv[1], sys.argv[2], sys.argv[3]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | import json import logging import sys from pathlib import Path from Bio import SeqIO log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def count_bgcs(gbk_file_path, genome_id=False, outfile=False): gbk_file_path = Path(gbk_file_path) if not type(genome_id) == str: genome_id = gbk_file_path.stem logging.debug(f"Summarize BGCs info for {gbk_file_path}") logging.debug(f"Genome id is: {genome_id}") records_list = SeqIO.parse(gbk_file_path, "genbank") bgc_stats = {} # statistics counter bgc_cntr = 0 protoclusters_cntr = 0 cand_clusters_cntr = 0 contig_edge_cntr = 0 # product capture bgc_type_dict = {} for rec in records_list: # Information on the number of BGCs, protoclusters and candidate clusters for feat in rec.features: if feat.type == "region": # get region counts bgc_cntr = bgc_cntr + 1 if feat.qualifiers["contig_edge"][0] == "True": contig_edge_cntr = contig_edge_cntr + 1 # get product counts bgc_type = ".".join(sorted(feat.qualifiers["product"])) if bgc_type in bgc_type_dict.keys(): bgc_type_dict[bgc_type] = bgc_type_dict[bgc_type] + 1 else: bgc_type_dict[bgc_type] = 1 if feat.type == "protocluster": protoclusters_cntr = protoclusters_cntr + 1 if feat.type == "cand_cluster": cand_clusters_cntr = cand_clusters_cntr + 1 bgc_stats["bgcs_count"] = bgc_cntr bgc_stats["bgcs_on_contig_edge"] = contig_edge_cntr bgc_stats["protoclusters_count"] = protoclusters_cntr bgc_stats["cand_clusters_count"] = cand_clusters_cntr result = {genome_id: bgc_stats | {"products": bgc_type_dict}} if not type(outfile) == str: return result else: logging.debug(f"Writing output to: {outfile}") with open(outfile, "w") as f: json.dump(result, f, indent=2) return if __name__ == "__main__": count_bgcs(sys.argv[1], sys.argv[2], sys.argv[3]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import sys from pathlib import Path import pandas as pd def get_bigscape_mapping(path, outfile): """ Given a path containing directories of antiSMASH result, return a table which maps BGC regions with its parent folder (aka accession ids) """ container = {} for x in Path(path).iterdir(): if x.is_dir(): genome_id = x.name bgcs = {item.stem: genome_id for item in list(x.glob("*.region*.gbk"))} container.update(bgcs) df = pd.DataFrame.from_dict(container, orient="index", columns=["genome_id"]) df.index.name = "bgc_id" df.to_csv(outfile) return None if __name__ == "__main__": get_bigscape_mapping(sys.argv[1], sys.argv[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | import logging import shutil import sqlite3 import sys from pathlib import Path import pandas as pd log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def get_bigslice_query( project_name, output_folder, result_folder="resources/bigslice/full_run_result/" ): bigslice_full_run = Path(result_folder) query_id = find_query_result(project_name, bigslice_full_run) fetch_query_tables(query_id, output_folder, bigslice_full_run) logging.info("Copying SQL database of the run...") shutil.copy( bigslice_full_run / f"reports/{query_id}/data.db", Path(output_folder) / f"{query_id}.db", ) logging.info("Job done!") return def find_query_result(project_name, bigslice_full_run): logging.info("Connecting to BiG-SLICE SQL Reports...") conn = sqlite3.connect(bigslice_full_run / "reports/reports.db") df = pd.read_sql_query("select * from reports;", conn) # select only run result of the project match = [] for i in df.index: if project_name in df.loc[i, "name"]: match.append(i) df = df.loc[match, :] logging.debug(f"Found resulting match to project {project_name}:\n{df}") # select the latest run for the project df["creation_date"] = pd.to_datetime(df["creation_date"]) filter_time = df.creation_date == df.creation_date.max() df = df.where(filter_time).dropna() logging.debug(f"Extracting information from the latest run:{df.name.values}") return int(df.id.values) def fetch_query_tables(query_id, output_folder, bigslice_full_run): # Connect to database conn = sqlite3.connect(bigslice_full_run / f"reports/{query_id}/data.db") # get all table name from sql cursor = conn.cursor() cursor.execute("SELECT name FROM sqlite_master WHERE type='table';") sql_table_names = [i[0] for i in cursor.fetchall()] output_folder = Path(output_folder) try: output_folder.mkdir(parents=True, exist_ok=False) except FileExistsError: logging.debug(f"Directory {output_folder} already exist") else: logging.debug(f"Creating directory: {output_folder}...") logging.debug(f"Extracting tables to {output_folder}...") # convert table to pandas df df_tables = {} for t in sql_table_names: df = pd.read_sql_query(f"select * from {t};", conn) outfile = output_folder / f"{t}.csv" df.to_csv(outfile, index=False) df_tables.update({t: df}) return if __name__ == "__main__": get_bigslice_query(sys.argv[1], sys.argv[2], sys.argv[3]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | import json import os import sys import pandas as pd def get_checkm_data(checkm_input_stats, checkm_json_folder, checkm_output_stats): """ Read checkm output stats.tsv file, return json output for individual genomes and output stats table Parameters ---------- 1. checkm_input_stats : str / path Location of the output from checkm software 'storage/bin_stats_ext.tsv' 2. checkm_json_folder : str / path Location of the json folder to store output for each genome 3. checkm_output_stats : str / path Location of the output csv file in the processed/tables directory Returns ------- 1. json : .json file A json file for each genome summarizing the checkm output 2. checkm_output_stats : str / path Output csv file in the processed/tables directory with checkm information for all genomes """ # List of columns in df_checkm_stats.csv checkm_columns = [ "Completeness", "Contamination", "GC", "GC std", "Genome size", "# ambiguous bases", "# scaffolds", "# contigs", "Longest scaffold", "Longest contig", "N50 (scaffolds)", "N50 (contigs)", "Mean scaffold length", "Mean contig length", "Coding density", "Translation table", "# predicted genes", ] df_checkm_input = pd.read_csv( checkm_input_stats, sep="\t", header=None, index_col=0 ) df_checkm_out = pd.DataFrame(index=df_checkm_input.index, columns=checkm_columns) df_checkm_out.index.name = "genome_id" for genome_id in df_checkm_out.index: json_genome_id = json.loads(df_checkm_input.loc[genome_id, 1].replace("'", '"')) json_path = os.path.join(checkm_json_folder, genome_id + ".json") with open(json_path, "w") as outfile: json.dump(json_genome_id, outfile) for col in df_checkm_out.columns: df_checkm_out.loc[genome_id, col] = json_genome_id[col] df_checkm_out.to_csv(checkm_output_stats) return df_checkm_out if __name__ == "__main__": get_checkm_data(sys.argv[1], sys.argv[2], sys.argv[3]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | import json import sys import yaml # list of the main dependecies used in the workflow dependencies = { "antismash": r"workflow/envs/antismash.yaml", "prokka": r"workflow/envs/prokka.yaml", "mlst": r"workflow/envs/mlst.yaml", "eggnog-mapper": r"workflow/envs/eggnog.yaml", "roary": r"workflow/envs/roary.yaml", "refseq_masher": r"workflow/envs/refseq_masher.yaml", "seqfu": r"workflow/envs/seqfu.yaml", } def get_dependency_version(dep, dep_key): """ return dependency version tags given a dictionary (dep) and its key (dep_key) """ with open(dep[dep_key]) as file: result = [] documents = yaml.full_load(file) for i in documents["dependencies"]: if type(i) == str: if i.startswith(dep_key): result = i.split("=")[-1] elif type(i) == dict: assert list(i.keys()) == ["pip"], i.keys() for p in i["pip"]: if dep_key in p: if p.startswith("git+"): result = p.split("@")[-1] result = result.replace("-", ".") else: result = p.split("=")[-1] return str(result) def write_dependecies_to_json(outfile, dep=dependencies): """ write dependency version to a json file """ with open(outfile, "w") as file: dv = {} for ky in dep.keys(): vr = get_dependency_version(dep, ky) dv[ky] = vr json.dump( dv, file, indent=2, ) file.close() return dv if __name__ == "__main__": write_dependecies_to_json(sys.argv[1]) |
1
of
data/get_dependencies.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | import json import logging import os import sys import pandas as pd log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def extract_mibig_info(mibig_json_path, mibig_bgc_table): """ Returns mibig BGC information critical for the known bgc table from JSON files Parameters ---------- 1. mibig_json_path : str / path Location of the resources folder with all MIBIG JSON files extracted Returns ------- 1. mibig_bgc_table: str / path Dataframe with MIBIG BGCs information 2. mibig_compound_table: str / path Dataframe with compounds information found at MIBIG """ logging.info("Reading MIBIG files...") if not os.path.isdir(mibig_json_path): raise FileNotFoundError(f"No such file or directory: {mibig_json_path}") df_mibig_bgcs = pd.DataFrame(columns=["biosyn_class", "compounds"]) # df_mibig_compounds = pd.DataFrame(columns=[]) mibig_list = [ mibig_file.split(".")[0] for mibig_file in os.listdir(mibig_json_path) if ".json" in mibig_file ] logging.debug(f"MIBIG entry has {len(mibig_list)} BGCs") for mibig_id in mibig_list: mibig_id_file = os.path.join(mibig_json_path, mibig_id + ".json") with open(mibig_id_file, "r") as json_obj: mibig_data = json.load(json_obj) df_mibig_bgcs.loc[mibig_id, "biosyn_class"] = ";".join( mibig_data.get("cluster").get("biosyn_class") ) compounds_list = mibig_data.get("cluster").get("compounds") df_mibig_bgcs.loc[mibig_id, "compounds"] = ";".join( [compound.get("compound") for compound in compounds_list] ) chem_acts_list = [] for compound in mibig_data.get("cluster").get("compounds"): if "chem_acts" in compound.keys(): chem_acts = compound.get("chem_acts") for activity in chem_acts: if activity not in chem_acts_list: chem_acts_list.append(activity) # logging.debug(f"{chem_acts_list}") # handle MIBIG 2.0 --> MIBIG 3.1 format difference chem_acts_list_clean = [] for item in chem_acts_list: if type(item) is dict: action = list(item.values()) for a in action: chem_acts_list_clean.append(a) elif type(item) is str: chem_acts_list_clean.append(item) # logging.debug(f"{chem_acts_list_clean}") df_mibig_bgcs.loc[mibig_id, "chem_acts"] = ";".join(chem_acts_list_clean) if "accession" in mibig_data.get("cluster").get("loci").keys(): df_mibig_bgcs.loc[mibig_id, "accession"] = ( mibig_data.get("cluster").get("loci").get("accession") ) if "completeness" in mibig_data.get("cluster").get("loci").keys(): df_mibig_bgcs.loc[mibig_id, "completeness"] = ( mibig_data.get("cluster").get("loci").get("completeness") ) if "evidence" in mibig_data.get("cluster").get("loci").keys(): df_mibig_bgcs.loc[mibig_id, "evidence"] = ";".join( mibig_data.get("cluster").get("loci").get("evidence") ) if "organism_name" in mibig_data.get("cluster").keys(): df_mibig_bgcs.loc[mibig_id, "organism_name"] = mibig_data.get( "cluster" ).get("organism_name") if "ncbi_tax_id" in mibig_data.get("cluster").keys(): df_mibig_bgcs.loc[mibig_id, "ncbi_tax_id"] = mibig_data.get( "cluster" ).get("ncbi_tax_id") if "publications" in mibig_data.get("cluster").keys(): df_mibig_bgcs.loc[mibig_id, "publications"] = ";".join( mibig_data.get("cluster").get("publications") ) df_mibig_bgcs.sort_index(inplace=True) df_mibig_bgcs.index.name = "mibig_id" df_mibig_bgcs.to_csv(mibig_bgc_table) return df_mibig_bgcs if __name__ == "__main__": extract_mibig_info(sys.argv[1], sys.argv[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | import os import sys import pandas as pd def extract_org_info(genome_id, samples_path, assembly_report_path, prokka_dir): """ Returns organism_info.txt with genus, species, and strain information required for prokka run inputs. This function returns values from provided sample.csv or extracted NCBI assembly reports in json format generated by get_assembly_information.py for "ncbi" source type. Parameters ---------- 1. genome_id : str Genome accession id 2. samples_path : str / path (separated by space for multiple paths) Location of the csv file (s) containing sample information as defined in samples.schema.yaml. For multiple samples csv, write the paths within double tick ("") and separate each paths with space. 3. assembly_report_path : str / path Location of the assembly information json file generated by get_assembly_information.py 4. prokka_dir : str / path Location of the prokka directory from the given NCBI assembly id Returns ------- 1. {prokka_dir}/organism_info.txt A one lined text file with genus, species, and strain information of a given genome_id """ # wrap single or multiple inputs & generate dataframe shell_input = samples_path.split() dfList = [pd.read_csv(s).set_index("genome_id", drop=False) for s in shell_input] df_samples = pd.concat(dfList, axis=0).fillna("") if df_samples.loc[genome_id, "source"] == "ncbi": # assembly_file = os.path.join(assembly_report_path, f"{genome_id}.txt") extract_ncbi_org_info(prokka_dir, genome_id, assembly_report_path) else: extract_samples_org_info(prokka_dir, genome_id, df_samples) return None def extract_ncbi_org_info(prokka_dir, genome_id, assembly_report_path): """ Returns organism_info.txt with genus, species, and strain information required for prokka run inputs. This function returns values from extracted NCBI assembly reports in json format generated by get_assembly_information.py. Parameters ---------- 1. prokka_dir : str / path Location of the prokka directory from the given NCBI assembly id 2. genome_id : str NCBI assembly id (GCA... / GCF...) 3. assembly_report_path : str / path Location of the assembly information json file generated by get_assembly_information.py Returns ------- 1. {prokka_dir}/organism_info.txt A one lined text file with genus, species, and strain information of a given NCBI assembly id """ assembly_report_json = os.path.join(assembly_report_path, f"{genome_id}.json") df_ncbi_meta = pd.read_json(assembly_report_json).T df_ncbi_meta = df_ncbi_meta.fillna("") for idx in df_ncbi_meta.index: GENUS = str(df_ncbi_meta.loc[idx, "genus"]) SPECIES = str(df_ncbi_meta.loc[idx, "species"]) STRAIN_ID = str(df_ncbi_meta.loc[idx, "strain"]) if not os.path.isdir(os.path.join(prokka_dir, idx)): os.mkdir(os.path.join(prokka_dir, idx)) org_info_path = os.path.join(prokka_dir, idx, "organism_info.txt") with open(org_info_path, "w") as file_obj: file_obj.write(",".join([GENUS, SPECIES, STRAIN_ID])) return None def extract_samples_org_info(prokka_dir, genome_id, df_samples): """ Returns organism_info.txt with genus, species, and strain information required for prokka run inputs. This function returns values from provided sample.csv excluding "ncbi" source type. Parameters ---------- 1. prokka_dir : str / path Location of the prokka directory from the given NCBI assembly id 2. genome_id : str Genome accession id 3. df_sample : pd.DataFrame object A dataframe generated by the csv file containing sample information as defined in samples.schema.yaml Returns ------- 1. {prokka_dir}/organism_info.txt A one lined text file with genus, species, and strain information of a given genome_id """ GENUS = str(df_samples.loc[genome_id, "genus"]) SPECIES = str(df_samples.loc[genome_id, "species"]) STRAIN_ID = str(df_samples.loc[genome_id, "strain"]) if not os.path.isdir(os.path.join(prokka_dir, genome_id)): os.mkdir(os.path.join(prokka_dir, genome_id)) org_info_path = os.path.join(prokka_dir, genome_id, "organism_info.txt") with open(org_info_path, "w") as file_obj: file_obj.write(",".join([GENUS, SPECIES, STRAIN_ID])) return None if __name__ == "__main__": extract_org_info(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | import json import logging import sys from pathlib import Path import peppy import yaml log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def refine_bgcflow_project(p_bgcflow, p): """ Refine a pep bgcflow project created from sample table. Exist for back compatibility with bgcflow=<0.3.3 Arguments: p_bgcflow: p: Returns: p_bgcflow: """ for k in p.keys(): if k == "samples": p_bgcflow.config["sample_table"] = p[k] p_bgcflow["_config_file"] = str(Path(p[k]).parent / "project_config.yaml") elif k in p_bgcflow.keys(): p_bgcflow[k] = p[k] elif k == "rules": with open(p["rules"], "r") as file: rules = yaml.safe_load(file) p_bgcflow.config[k] = rules["rules"] else: p_bgcflow.config[k] = Path(p[k]).name return p_bgcflow def get_bgcflow_metadata(bgcflow_path="."): # BGCFlow config BGCFlow_path = Path(bgcflow_path).resolve() logging.info("Getting config metadata...") config_path = BGCFlow_path / "config/config.yaml" with open(config_path, "r") as f: config = yaml.safe_load(f) logging.info("Getting rules information...") # rules_dict rules_dict_path = BGCFlow_path / "workflow/rules.yaml" with open(rules_dict_path, "r") as f: rules_dict = yaml.safe_load(f) return BGCFlow_path, config, rules_dict def get_all_metadata(config, rules_dict, BGCFlow_path, bgcflow_version): # Get Metadata logging.info("Getting metadata from projects...") project_metadata = {} for p in config["projects"]: # mask pep and name as alias if "pep" in p.keys(): p["name"] = p.pop("pep") # process only pep projets if p["name"].endswith(".yaml"): project = peppy.Project( str(BGCFlow_path / p["name"]), sample_table_index="genome_id" ) if "description" in project.config.keys(): name = project["name"] description = project.config["description"] project_metadata[name] = {"description": description} else: # non peppy input name = p["name"] project = peppy.Project( str(BGCFlow_path / p["samples"]), sample_table_index="genome_id" ) p = refine_bgcflow_project(project, p) project_metadata[name] = {"description": "No description provided."} # get what rules are being used rule_used = {} if "rules" in project.config.keys(): rules = project.config["rules"] else: rules = config["rules"] for r in rules.keys(): if rules[r]: if r in rules_dict.keys(): bgcflow_rules = rules_dict[r] rule_used[r] = bgcflow_rules project_metadata[name].update({"rule_used": rule_used}) # get sample size project_metadata[name]["sample_size"] = len(project.sample_table) # get citations citation_all = [] for r in rule_used: citations = rules_dict[r]["references"] citation_all.extend(citations) citation_all.sort() project_metadata[name].update({"references": citation_all}) project_metadata[name]["references"] = list( set(project_metadata[name]["references"]) ) # get bgcflow_version project_metadata[name]["bgcflow_version"] = bgcflow_version return project_metadata def get_project_metadata( project_name, outfile, bgcflow_path=".", bgcflow_version="unknown" ): BGCFlow_path, config, rules_dict = get_bgcflow_metadata(bgcflow_path) all_metadata = get_all_metadata(config, rules_dict, BGCFlow_path, bgcflow_version) logging.info(f"Extracting project {project_name} metadata to {outfile}") with open(outfile, "w") as f: json.dump({project_name: all_metadata[project_name]}, f, indent=2) return if __name__ == "__main__": get_project_metadata(sys.argv[1], sys.argv[2], bgcflow_version=sys.argv[3]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 | import json import logging import os import sys import pandas as pd import requests log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def gtdb_prep( genome_id, outfile, samples_table, tax_path, release="R207" ): # what happen if it does not find? """ Given a genome id and the samples Pandas dataframe, write a JSON file containing taxonomic information from GTDB API. The script will first search using the closest taxonomic placement (NCBI accession id), then using the genus information provided by user. If no information is provided, return an empty taxonomic information. """ class EmptyTaxError(Exception): """Raised when this script returns empty dict""" pass class PlacementError(Exception): """Raised when this script returns empty dict""" pass def empty_result(genome_id): """ Helper script for creating empty result """ logging.info( f"No taxonomic information found, returning empty values for {genome_id}." ) gtdb_tax = {} gtdb_tax["genome_id"] = genome_id gtdb_tax["gtdb_taxonomy"] = { "domain": "d__", "phylum": "p__", "class": "c__", "order": "o__", "family": "f__", "genus": "g__", "species": "s__", } return gtdb_tax def find_taxonomy(query, genome_id, gtdb_tax): """ Helper script to decide taxonomic placement for a given query """ # If closest placement reference is provided, try finding taxonomic information from GTDB API if query.closest_placement_reference.values[0] != "": try: logging.info( "Inferring taxonomic placement from provided closest reference...." ) gtdb_tax = get_ncbi_taxon_GTDB( query.closest_placement_reference.values[0], release ) gtdb_tax["genome_id"] = genome_id except KeyError as e: raise PlacementError( f"Cannot infer taxonomic placement from provided closest reference. Make sure the accession id: {query.closest_placement_reference.values[0]} is part of GTDB release: {e}" ) # If NCBI accession is provided, try to find taxonomic information from GTDB API elif query.source.values[0] == "ncbi": try: logging.info("Inferring taxonomic placement from NCBI accession....") gtdb_tax = get_ncbi_taxon_GTDB(query.genome_id.values[0], release) except KeyError: if query.genus.values[0] != "": logging.info( "Inferring taxonomic placement from provided genus information...." ) gtdb_tax["genome_id"] = genome_id gtdb_tax.update( get_parent_taxon_GTDB(query.genus.values[0], "genus", release) ) gtdb_tax["gtdb_taxonomy"][ "species" ] = f"s__{gtdb_tax['gtdb_taxonomy']['genus'].split('__')[-1]} sp." else: gtdb_tax = empty_result(genome_id) # Try to get taxonomic information from genus information elif query.genus.values[0] != "": logging.info( "Inferring taxonomic placement from provided genus information...." ) gtdb_tax["genome_id"] = genome_id gtdb_tax.update( get_parent_taxon_GTDB(query.genus.values[0], "genus", release) ) try: gtdb_tax["gtdb_taxonomy"][ "species" ] = f"s__{gtdb_tax['gtdb_taxonomy']['genus'].split('__')[-1]} sp." except KeyError: gtdb_tax = empty_result(genome_id) # If no information is found, return an empty dict else: gtdb_tax = empty_result(genome_id) return gtdb_tax # get query by subsetting samples df with genome id shell_input = samples_table.split() dfList = [pd.read_csv(s).set_index("genome_id", drop=False) for s in shell_input] df_samples = pd.concat(dfList, axis=0) query = df_samples[df_samples.loc[:, "genome_id"] == genome_id].fillna("") # create empty container gtdb_tax = {} # Starting process logging.info(f"Fetching taxonomic information for {genome_id}...") # Go through user provided taxonomic placement if any(os.path.isfile(t) for t in tax_path.split()): try: gtdb_tax = get_user_defined_classification(genome_id, tax_path) except KeyError: logging.warning( f"{genome_id}: Not found in user provided taxonomic placement..." ) gtdb_tax = find_taxonomy(query, genome_id, gtdb_tax) else: gtdb_tax = find_taxonomy(query, genome_id, gtdb_tax) if gtdb_tax == {}: raise EmptyTaxError( "Oops, this shouldn't happen. It returns an empty dict. Something is wrong with the script." ) logging.info(f"Writing results to {outfile}...") with open(outfile, "w") as file: json.dump(gtdb_tax, file, indent=2) file.close logging.info("Job finished!") return def get_user_defined_classification(genome_id, tax_path): """ Get taxonomic information from user provided GTDB-like output """ shell_input = tax_path.split() dfList = [ pd.read_csv(s, sep="\t").set_index("user_genome", drop=False) for s in shell_input ] df_tax_raw = pd.concat(dfList, axis=0) # drop duplicates! causes error logging.debug(f"Checking user provided taxonomy table from: {shell_input}") logging.info( "Checking if user provided taxonomy table contains duplicated genome ids.." ) df_tax_dup = df_tax_raw["user_genome"].duplicated() if df_tax_dup.any(): logging.warning( f"Found duplicated genome ids: {list(df_tax_raw[df_tax_dup]['user_genome'].unique())}" ) duplicates_mask = df_tax_raw.duplicated(keep="first") df_tax = df_tax_raw[~duplicates_mask].copy() logging.debug("Making sure duplicated genome ids values are identical...") assert ( not df_tax.duplicated().any() ), "Two or more genome ids have more than one taxonomic placement! Please check your taxonomy files!" else: df_tax = df_tax_raw.copy() level_dict = { "d": "domain", "p": "phylum", "c": "class", "o": "order", "f": "family", "g": "genus", "s": "species", } query = df_tax.loc[genome_id, "classification"].split(";") result = {} result["genome_id"] = genome_id result["gtdb_url"] = "user provided classification" result["gtdb_release"] = "unknown" result["gtdb_taxonomy"] = {level_dict[q.split("__")[0]]: q for q in query} logging.info("Using user provided GTDB classification.") return result def get_ncbi_taxon_GTDB(accession, release="R207"): """ Given an NCBI accession, return a json object of taxonomic information from GTDB API """ def gtdb_api_request(accession, api_type): if api_type == "taxonomy": api_url = ( f"https://api.gtdb.ecogenomic.org/genome/{accession}/taxon-history" ) elif api_type == "summary": api_url = f"https://api.gtdb.ecogenomic.org/genome/{accession}/card" response = requests.get(api_url) try: js = response.json() except json.JSONDecodeError: logging.critical( f"Cannot decode response from GTDB API. Make sure this is a valid url: {api_url}" ) raise return js, api_url # Mapping to bgcflow format level_dict = { "d": "domain", "p": "phylum", "c": "class", "o": "order", "f": "family", "g": "genus", "s": "species", } # get taxonomic information js_tax, api_url_tax = gtdb_api_request(accession, "taxonomy") result = {} result["genome_id"] = accession result["gtdb_url"] = api_url_tax result["gtdb_release"] = release card_detail = "Genome found" try: if js_tax == []: logging.warning( f"Genome id: {accession} is in GTDB but has no taxonomic placement. Returning empty values." ) result["gtdb_taxonomy"] = { level_dict[k]: f"{k}__" for k in level_dict.keys() } card_detail = ( f"Genome not found - no taxonomic assignment in release {release}" ) else: logging.info(js_tax) tax = [tax for tax in js_tax if tax["release"] == release] if len(tax) == 1: result["gtdb_taxonomy"] = tax[0] elif len(tax) == 0: logging.warning( f"Genome id: {accession} is in GTDB but has no taxonomic placement in release {release}. Returning empty values." ) result["gtdb_taxonomy"] = { level_dict[k]: f"{k}__" for k in level_dict.keys() } card_detail = ( f"Genome not found - no taxonomic assignment in release {release}" ) else: raise result["gtdb_taxonomy"].pop("release", None) result["gtdb_taxonomy"] = { level_dict[k]: result["gtdb_taxonomy"][k] for k in result["gtdb_taxonomy"].keys() } except KeyError as err: if err.args[0] == "gtdb_taxonomy": logging.critical( f"Malformed genome id: {accession}. Make sure to use the right NCBI genome accession format." ) raise elif err.args[0] == release: logging.critical(f"Cannot find genome id: {accession} in GTDB API.") raise # get other metadata from genome summary js_sum, api_url_sum = gtdb_api_request(accession, "summary") result["metadata_url"] = api_url_sum result["metadata"] = js_sum if "detail" in result["metadata"].keys(): pass else: result["metadata"]["detail"] = card_detail return result def get_parent_taxon_GTDB(taxon, level, release="R207"): """ Given a taxon and its level, return a json object of parent taxons from GTDB API """ level_dict = { "domain": "d", "phylum": "p", "class": "c", "order": "o", "family": "f", "genus": "g", "species": "s", } try: query = f"{level_dict[level]}__{taxon}" except KeyError: logging.critical( f"Incorrect taxon level format. Please choose from available format: {list(level_dict.keys())}." ) raise api_url = f"https://api.gtdb.ecogenomic.org/taxonomy/partial/{query}/all-releases" response = requests.get(api_url) js = response.json() result = {} result["gtdb_url"] = api_url result["gtdb_release"] = release result["gtdb_taxonomy"] = {} level_dict_rev = { "d": "domain", "p": "phylum", "c": "class", "o": "order", "f": "family", "g": "genus", "s": "species", } try: for item in js: if release in item["release"]: t = item["taxonomy"] taxonomy = {level_dict_rev[k]: t[k] for k in t.keys()} result["gtdb_taxonomy"] = taxonomy except TypeError: logging.critical(f"{js['message']}") raise return result if __name__ == "__main__": gtdb_prep(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | import json import logging import sys from pathlib import Path log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def assess_gtdb_json_file(item): """ Given a json file, assess whether the accession can be found via GTDB-API or not. Return a genome_id when accession cannot be found. """ logging.info(f"Assessing {item}") with open(item, "r") as json_file: data = json.load(json_file) genome_id = data["genome_id"] try: gtdb_release = data["gtdb_release"] metadata = data["metadata"] if "Genome not found" in metadata["detail"]: logging.debug( f" - {genome_id} : {metadata['detail']} in GTDB-API release {gtdb_release}" ) return genome_id elif type(metadata["genome"]["accession"]) == str: logging.debug( f" - {genome_id} can be found via GTDB-API release {gtdb_release}" ) return None except KeyError: logging.debug(f" - {genome_id} does not have metadata") return genome_id def generate_symlink_gtdbtk(input_fna, gtdb_json, outdir): """ Given a gtdb_json file and an input_fna file, generate a symlink to a desired location if genome_id cannot be found via GTDB API """ input_fna = Path(input_fna).resolve() gtdb_json = Path(gtdb_json).resolve() outdir = Path(outdir) outdir.mkdir(parents=True, exist_ok=True) assert input_fna.is_file() and gtdb_json.is_file() genome_id = assess_gtdb_json_file(gtdb_json) if genome_id is not None: outfile = outdir / f"{genome_id}.fna" logging.info(f"Generating input files for GTDB-tk: {outfile}") outfile.symlink_to(input_fna) return None def input_handling(input_list, category, suffix=".json"): input_list = Path(input_list) if input_list.is_file() and input_list.suffix == suffix: logging.info(f"Getting {category} from a single file: {input_list}") input_list_files = input_list elif input_list.is_file() and input_list.suffix == ".txt": logging.info(f"Getting {category} from a text file: {input_list}") with open(input_list, "r") as file: file_content = [i.strip("\n") for i in file.readlines()] if len(file_content) == 1: # Paths space-separated on a single line logging.info( " - Detecting space-separated input in a single line format." ) paths = file_content[0].split() else: # Paths written on separate lines logging.info(" - Detecting input in a multi-line format.") paths = file_content input_list_files = [ Path(path) for path in paths if Path(path).suffix == suffix ] else: input_list_files = [ Path(file) for file in str(input_list).split() if Path(file).suffix == suffix ] logging.info( f"Getting {category} from the given list of {len(input_list_files)} files..." ) return input_list_files def gtdbtk_prep(fna_list, json_list, outdir): """ Given a list of gtdb_json file and an list of fna, generate a symlinks to a desired location if genome_id cannot be found via GTDB API """ shell_json_input = input_handling(json_list, "taxonomy json") shell_fna_input = input_handling(fna_list, "fna files", suffix=".fna") for gtdb_json in shell_json_input: gid = Path(gtdb_json).stem input_fna = [fna for fna in shell_fna_input if gid in Path(fna).stem] logging.info( f"Found {gid} in {[str(i) for i in input_fna]}. Generating symlink..." ) generate_symlink_gtdbtk(str(input_fna[0]), str(gtdb_json), str(outdir)) return if __name__ == "__main__": gtdbtk_prep(sys.argv[1], sys.argv[2], sys.argv[3]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 | import logging import os import sys from collections import OrderedDict from datetime import datetime from pathlib import Path import networkx as nx import pandas as pd from alive_progress import alive_bar log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def get_cluster_dataframes( df_genomes, df_nodes, mibig_bgc_table, as_dir="../data/antismash/" ): """ Returns two dataframes of clusters with information from genomes and MIBIG """ df_clusters = pd.DataFrame( columns=["product", "bigscape_class", "genome_id", "accn_id"] ) df_known = pd.DataFrame( columns=[ "product", "bigscape_class", "biosyn_class", "compounds", "chem_acts", "accession", "completeness", "organism_name", "ncbi_tax_id", "publications", "evidence", ] ) df_mibig_bgcs = pd.read_csv(mibig_bgc_table, index_col="mibig_id") # Generate bgcs dataframe with metadata from df_nodes and df_genomes logging.info( "Generating bgcs dataframe with metadata from df_nodes and df_genomes..." ) with alive_bar(len(df_genomes.index)) as bar: for genome_id in df_genomes.index: logging.info(f"Processing BGCs in the genome: {genome_id}") if genome_id in os.listdir(as_dir): genome_dir = os.path.join(as_dir, genome_id) bgc_id_list = [ region[:-4] for region in os.listdir(genome_dir) if "region0" in region ] for bgc_id in bgc_id_list: logging.debug(f"Processing {bgc_id}") if bgc_id in df_nodes.index: df_clusters.loc[bgc_id, "genome_id"] = genome_id df_clusters.loc[bgc_id, "product"] = df_nodes.loc[ bgc_id, "Product Prediction" ] df_clusters.loc[bgc_id, "bigscape_class"] = df_nodes.loc[ bgc_id, "BiG-SCAPE class" ] df_clusters.loc[bgc_id, "accn_id"] = df_nodes.loc[ bgc_id, "Accession ID" ] else: logging.debug(f"{bgc_id} not in df_nodes") else: logging.warning(f"{genome_id} not in directory!") bar() # Generate separate table for known BGCs from MIBIG logging.info("Generating separate table for known BGCs from MIBIG") with alive_bar(len(df_nodes.index)) as bar: for bgc_id in df_nodes.index: if "BGC0" in bgc_id: df_known.loc[bgc_id, "product"] = df_nodes.loc[ bgc_id, "Product Prediction" ] df_known.loc[bgc_id, "bigscape_class"] = df_nodes.loc[ bgc_id, "BiG-SCAPE class" ] if "." in bgc_id: mibig_id = bgc_id.split(".")[0] else: mibig_id = bgc_id if mibig_id in df_mibig_bgcs.index: for col in df_mibig_bgcs.columns: df_known.loc[bgc_id, col] = df_mibig_bgcs.loc[mibig_id, col] else: logging.debug(f"{bgc_id} not in df_mibig_bgcs dataframe index") bar() return df_known, df_clusters def add_bigscape_families(df_clusters, df_known, net_data_path): """ Adds GCC and GCF numbers detected by BiG-SCAPE clustering for different cut-offs """ cluster_class_set = [ cluster_class for cluster_class in os.listdir(net_data_path) if ".tsv" not in cluster_class ] for cluster_class in cluster_class_set: logging.info(f"Processing all BGCs from {cluster_class}") class_dir = os.path.join(net_data_path, cluster_class) gcf_cutoffs_files = [ file for file in os.listdir(class_dir) if "_clustering_" in file ] for gcf_file in gcf_cutoffs_files: cutoff = gcf_file[-8:-4] gcf_path = os.path.join(class_dir, gcf_file) df_clusters = read_gcf_df(df_clusters, gcf_path, cutoff) df_known = read_gcf_df(df_known, gcf_path, cutoff) clan_files = [file for file in os.listdir(class_dir) if "_clans_" in file] if len(clan_files) == 1: clan_select = clan_files[0] clan_path = os.path.join(class_dir, clan_select) df_clusters = read_gcc_df(df_clusters, clan_path) df_known = read_gcc_df(df_known, clan_path) return df_clusters, df_known def read_gcf_df(df_clusters, gcf_path, cutoff): """ Adds GCF (Gene Cluster Family) number for each BGC """ df_gcf = pd.read_csv(gcf_path, sep="\t", index_col="#BGC Name", dtype=str) col_name = "gcf_" + cutoff for bgc in df_clusters.index: if bgc in df_gcf.index: df_clusters.loc[bgc, col_name] = df_gcf.loc[bgc, "Family Number"] return df_clusters def read_gcc_df(df_clusters, clan_path): """ Adds GCC (Gene Cluster Clan) number for each BGC """ df_gcc = pd.read_csv(clan_path, sep="\t", index_col="#BGC Name", dtype=str) col_name = "Clan Number" for bgc in df_clusters.index: if bgc in df_gcc.index: df_clusters.loc[bgc, col_name] = df_gcc.loc[bgc, "Clan Number"] return df_clusters def get_bigscape_network(net_data_path, cutoff="0.30"): """ Reads similarity network for a particular to a dataframe """ cluster_class_set = [ cluster_class for cluster_class in os.listdir(net_data_path) if ".tsv" not in cluster_class ] df_network = pd.DataFrame() for cluster_class in cluster_class_set: class_dir = os.path.join(net_data_path, cluster_class) net_file = cluster_class + "_c" + cutoff + ".network" net_path = os.path.join(class_dir, net_file) if os.path.isfile(net_path): df_class_net = pd.read_csv(net_path, sep="\t") df_network = pd.concat([df_network, df_class_net], ignore_index=True) return df_network def get_network_graph(df_network, weight="Raw distance"): """ Returns networkX graph for a given network """ G_clusters = nx.from_pandas_edgelist( df_network, "Clustername 1", "Clustername 2", weight ) G_families = nx.connected_components(G_clusters) family_nodes = [c for c in sorted(G_families, key=len, reverse=True)] return G_clusters, family_nodes def remove_single_mibig(df_network, df_known, family_nodes): """ Removes singleton MIBIG BGCs from the network """ logging.info("Removing singleton MIBIG BGCs from the network") nodes_to_remove = [] for fam in family_nodes: single_mibig = True for node in fam: if "BGC" not in node: single_mibig = False if single_mibig: for node in fam: if node not in nodes_to_remove: nodes_to_remove.append(node) logging.debug( f"{len(nodes_to_remove)} number of MIBIG nodes will be removed from analysis due to no similarity" ) df_network = df_network[~df_network["Clustername 1"].isin(nodes_to_remove)] df_network = df_network[~df_network["Clustername 2"].isin(nodes_to_remove)] for node in nodes_to_remove: if node in df_known.index: df_known = df_known.drop(node) return df_network, df_known def get_family_graph(G_clusters): """ Returns families list as networkx graph """ # Find connected components or cluster families Families_list = list(nx.connected_components(G_clusters)) # Sort families in decreasing order of occurrence family_size = [len(family) for family in Families_list] orig_index = list(range(len(family_size))) orig_index_fam_size_dict = dict(zip(orig_index, family_size)) sorted_index_fam_size_dict = OrderedDict( sorted(orig_index_fam_size_dict.items(), key=lambda x: x[1]) ) new_index = list(range(len(family_size) - 1, -1, -1)) orig_index_sorted = list(sorted_index_fam_size_dict.keys()) new_orig_index_dict = dict(zip(new_index, orig_index_sorted)) # Ordered family graphs family_graphs = [ Families_list[new_orig_index_dict[fam_id]] for fam_id in range(len(Families_list)) ] return family_graphs def update_cluster_family( df_clusters, df_known, family_nodes, mibig_bgc_table, cutoff="0.30" ): """ Updates df_clusters with family ids (connected components) """ df_mibig_bgcs = pd.read_csv(mibig_bgc_table, index_col="mibig_id") df_families = pd.DataFrame( columns=["fam_type", "fam_name", "clusters_in_fam", "mibig_ids"] ) for cntr in range(len(family_nodes)): fam_id = cntr + 1 family = family_nodes[cntr] known_bgcs = [bgc for bgc in family if bgc in df_mibig_bgcs.index] # if len(known_bgcs) > 0: df_families.loc[fam_id, "fam_type"] = "known_family" logging.debug( f'Family {fam_id} has {len(known_bgcs)} known BGCs: {", ".join(known_bgcs)}' ) # handle missing entry from MIBIG release try: known_compounds = ";".join( df_known.loc[known_bgcs, "compounds"].tolist() ) except TypeError: logging.warning( f"MIBIG entries {known_bgcs} returned no compounds: {df_known.loc[known_bgcs, 'compounds'].values}" ) logging.warning( "This is likely because of missing JSON entry in the downloaded MIBIG release." ) logging.warning( "Check if this is the case in the resources/mibig/df_mibig_bgcs.csv folder" ) logging.warning("Replacing compound value with link to MIBIG...") for h in known_bgcs: link_mibig = f"https://mibig.secondarymetabolites.org/repository/{h.split('.')[0]}/index.html" logging.warning( f"{h}: Replacing compound {df_known.loc[h, 'compounds']} with {link_mibig}" ) df_known.loc[h, "compounds"] = link_mibig df_families.loc[fam_id, "fam_name"] = known_compounds df_families.loc[fam_id, "clusters_in_fam"] = len(family) df_families.loc[fam_id, "mibig_ids"] = ";".join(known_bgcs) else: df_families.loc[fam_id, "fam_type"] = "unknown_family" bgc_class = ",".join( df_clusters.loc[list(family), "bigscape_class"].unique().tolist() ) df_families.loc[fam_id, "fam_name"] = "u_" + bgc_class + "_" + str(fam_id) df_families.loc[fam_id, "clusters_in_fam"] = len(family) for bgc in family: if bgc in df_clusters.index: df_clusters.loc[bgc, "fam_id_" + cutoff] = str(fam_id) if len(known_bgcs) > 0: df_clusters.loc[bgc, "fam_type_" + cutoff] = "known_family" known_compounds = ";".join( df_known.loc[known_bgcs, "compounds"].tolist() ) df_clusters.loc[ bgc, "fam_known_compounds_" + cutoff ] = known_compounds else: df_clusters.loc[bgc, "fam_type_" + cutoff] = "unknown_family" df_clusters.loc[bgc, "fam_known_compounds_" + cutoff] = ( "u_" + bgc_class + "_" + str(fam_id) ) elif bgc in df_known.index: df_known.loc[bgc, "fam_id_" + cutoff] = str(fam_id) df_known.loc[bgc, "fam_type_" + cutoff] = "known_family" known_compounds = ";".join( df_known.loc[known_bgcs, "compounds"].tolist() ) df_known.loc[bgc, "fam_known_compounds_" + cutoff] = known_compounds return df_clusters, df_known, df_families def get_family_presence(df_clusters, df_genomes, df_families, cutoff): """ Returns BGC family presence absence matrix across genomes """ df_family_presence = pd.DataFrame( 0, index=df_genomes.index, columns=df_families.index.astype(str) ) for bgc_id in df_clusters.index: fam_id = str(df_clusters.loc[bgc_id, "fam_id_" + cutoff]) genome_id = df_clusters.loc[bgc_id, "genome_id"] df_family_presence.loc[genome_id, fam_id] = 1 return df_family_presence def run_family_analysis( cutoff, net_data_path, df_clusters, df_genomes, df_known_all, output_dir, mibig_table, query_name, ): logging.info(f"Processing data from BiG-SCAPE with cutoff {cutoff}") df_network = get_bigscape_network(net_data_path, cutoff=cutoff) G_clusters, family_nodes = get_network_graph(df_network, weight="Raw distance") df_network, df_known = remove_single_mibig(df_network, df_known_all, family_nodes) G_clusters, family_nodes = get_network_graph(df_network, weight="Raw distance") singleton_bgc = [list(fam)[0] for fam in family_nodes if len(fam) == 1] family_graphs = get_family_graph(G_clusters) df_clusters, df_known, df_families = update_cluster_family( df_clusters, df_known, family_nodes, mibig_table, cutoff=cutoff ) df_family_presence = get_family_presence( df_clusters, df_genomes, df_families, cutoff=cutoff ) logging.debug(f"Number of genomes: {df_genomes.shape[0]}") logging.debug(f"Number of BGCs: {df_clusters.shape[0]}") logging.debug(f"Number of edges in the network: {df_network.shape[0]}") logging.debug(f"Number of total families: {len(family_nodes)}") logging.debug( f"Number of total non-single families: {len(family_nodes) - len(singleton_bgc)}" ) logging.debug(f"Number of singleton families in the network: {len(singleton_bgc)}") logging.debug( f"Number of families with known BGCs: {df_families[df_families.fam_type == 'known_family'].shape[0]}" ) logging.debug(f"Number of known BGCs in the network: {df_known.shape[0]}") logging.debug( f"Number of BGCs in top 10 families {[len(fam) for fam in family_graphs[:10]]}" ) df_known_families = df_families[df_families.fam_type == "known_family"] logging.debug( f"Some of the common known BGCs{chr(10)}{chr(10)}{df_known_families.iloc[:20,1:3]}{chr(10)}" ) df_unknown_families = df_families[df_families.fam_type == "unknown_family"] logging.debug( f"Some of the common unknown BGCs:{chr(10)}{chr(10)}{df_unknown_families.iloc[:20,1:3]}{chr(10)}" ) # Assign index name df_network.index.name = "bigscape_edge_id" df_known.index.name = "bgc_id" df_families.index.name = f"fam_id_{cutoff}" df_clusters.index.name = "bgc_id" df_family_presence.index.name = "genome_id" # Save all the dataframes df_network.to_csv(f"{output_dir}/{query_name}_df_network_" + cutoff + ".csv") df_known.to_csv(f"{output_dir}/{query_name}_df_known_" + cutoff + ".csv") df_families.to_csv(f"{output_dir}/{query_name}_df_families_" + cutoff + ".csv") df_clusters.to_csv(f"{output_dir}/{query_name}_df_clusters_" + cutoff + ".csv") df_family_presence.to_csv( f"{output_dir}/{query_name}_df_family_presence_" + cutoff + ".csv" ) return df_clusters, df_families, df_network def process_bigscape_output( bigscape_directory, as_dir, df_genomes_path, mibig_bgc_table, output_dir ): bigscape_directory = Path(bigscape_directory) output_dir = Path(output_dir) output_dir.mkdir(parents=True, exist_ok=True) # get the latest run bigscape_runs = {} logging.info("Getting BiG-SCAPE runs...") for net_data_path in bigscape_directory.glob("network_files/*"): logging.debug(f"Found network data: {net_data_path}") selected_run_folder = net_data_path.name selected_run_time = selected_run_folder.split("_glocal")[0] selected_run_time = datetime.strptime(selected_run_time, "%Y-%m-%d_%H-%M-%S") bigscape_runs[selected_run_time] = net_data_path run_times = list(bigscape_runs.keys()) run_times.sort(reverse=True) # make sure run times has values assert len(run_times) > 0 selected_run = run_times[0] net_data_path = bigscape_runs[selected_run] logging.info(f"Processing {selected_run}") node_annot_path = ( net_data_path / "Network_Annotations_Full.tsv" ) # Read the BGC table df_nodes = pd.read_csv(node_annot_path, index_col="BGC", sep="\t") # Generate df_clusters and df_known dataframe df_genomes = pd.read_csv(df_genomes_path).set_index("genome_id", drop=False) df_genomes.to_csv(f"{output_dir}/df_genome_antismash_summary.csv") df_known_all, df_clusters = get_cluster_dataframes( df_genomes, df_nodes, mibig_bgc_table, as_dir ) assert ( len(df_clusters) > 0 ), f"df_cluster is empty {df_clusters.shape}. Check folder data/interim/bgcs to have proper antismash region genbank files." # Enrich dataframes with BiGSCAPE information on GCCs and GCFs with cutoffs df_clusters, df_known_all = add_bigscape_families( df_clusters, df_known_all, net_data_path ) # Get GCF data as per the cutoff for cutoff in ["0.30", "0.40", "0.50"]: df_clusters, df_families, df_network = run_family_analysis( cutoff, net_data_path, df_clusters, df_genomes, df_known_all, output_dir, mibig_bgc_table, str(selected_run).replace(":", "_"), ) return if __name__ == "__main__": process_bigscape_output( sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5] ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | import json import logging import sys from pathlib import Path import pandas as pd from alive_progress import alive_bar log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def combine_bgc_counts(input_json, filter_str="_bgc_counts"): container = {} logging.info("Reading json files...") with alive_bar(len(input_json), title="Merging BGC counts:") as bar: for item in input_json: item = Path(item) genome_id = item.stem if filter_str in genome_id: genome_id = genome_id.replace(filter_str, "") logging.debug(f"Processing {genome_id}") with open(item, "r") as f: data = json.load(f) products = data[genome_id].pop("products") data[genome_id] = data[genome_id] | products container.update(data) bar() return container def write_genome_table(input_json, samples_table, genome_table): """ Write df_genomes.csv table in processed data """ # Accomodate multiple inputs to generate dataframe shell_input = samples_table.split() logging.info(f"Reading samples table: {shell_input}") dfList = [pd.read_csv(s).set_index("genome_id", drop=False) for s in shell_input] df_samples = pd.concat(dfList, axis=0) # Handle multiple json input_json = Path(input_json) logging.info(input_json) if input_json.is_file() and input_json.suffix == ".json": logging.info(f"Getting BGC overview from a single file: {input_json}") input_json_files = input_json elif input_json.is_file() and input_json.suffix == ".txt": logging.info(f"Getting BGC overview from a text file: {input_json}") with open(input_json, "r") as file: file_content = [i.strip("\n") for i in file.readlines()] if len(file_content) == 1: # Paths space-separated on a single line paths = file_content[0].split() else: # Paths written on separate lines paths = file_content input_json_files = [ Path(path) for path in paths if Path(path).suffix == ".json" ] else: input_json_files = [ Path(file) for file in str(input_json).split() if Path(file).suffix == ".json" ] logging.info( f"Getting BGC overview from the given list of {len(input_json_files)} files..." ) bgc_counts = combine_bgc_counts(input_json_files) bgc_counts = pd.DataFrame.from_dict(bgc_counts).T logging.debug(f"Writing file to: {genome_table}") # Generate dataframe df_genomes = pd.concat([df_samples, bgc_counts], axis=1) # Save dataframes to csv tables genome_table = Path(genome_table) genome_table.parent.mkdir(parents=True, exist_ok=True) df_genomes.to_csv(genome_table, index=False) logging.info("Job done") return None if __name__ == "__main__": write_genome_table(sys.argv[1], sys.argv[2], sys.argv[3]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 | import logging import os import sys from shutil import copyfile import matplotlib.pyplot as plt import pandas as pd import plotly.express as px import plotly.graph_objects as go from Bio import Phylo, SeqIO log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def get_roary_data( roary_interim_folder, roary_processed_folder, core=0.99, softcore=0.95, shell=0.15 ): """ Copy important files from Roary interim to processed directory Parameters ---------- 1. roary_folder : str / path Location of the output from Roary in the interim directory 2. roary_processed_folder : str / path Location of the processed output directory for important Roary results 3. core: int (0 to 1) default = 0.99 Fraction of genomes the gene must be present to be group in core (99% <= strains <= 100%) 4. softcore: int (0 to 1) default = 0.95 Fraction of genomes the gene must be present to be group in softcore (95% <= strains < 99%) 5. shell: int (0 to 1) default = 0.15 Fraction of genomes the gene must be present to be group in softcore (15% <= strains < 95%) Remaining genes will be in cloud Returns ------- 1. roary_processed_folder : str / path Location of the processed output directory for important Roary results """ gene_presence_path = os.path.join(roary_interim_folder, "gene_presence_absence.csv") df_gene_presence_summary = pd.read_csv( gene_presence_path, index_col="Gene", low_memory=False ) # Extract gene annotation columns to separate dataframe logging.info("Extract gene annotation columns to separate dataframe") gene_summary_columns = [ "Non-unique Gene name", "Annotation", "No. isolates", "No. sequences", "Avg sequences per isolate", "Genome Fragment", "Order within Fragment", "Accessory Fragment", "Accessory Order with Fragment", "QC", "Min group size nuc", "Max group size nuc", "Avg group size nuc", ] gene_summary_out_interim = os.path.join( roary_interim_folder, "df_pangene_summary.csv" ) gene_summary_out_processed = os.path.join( roary_processed_folder, "df_pangene_summary.csv" ) df_gene_summary = df_gene_presence_summary[gene_summary_columns].fillna("") # Extract locus tags logging.info("Extract locus tags") df_gene_presence = df_gene_presence_summary.drop( columns=gene_summary_columns ).fillna("") gene_presence_locustag_out_interim = os.path.join( roary_interim_folder, "df_gene_presence_locustag.csv" ) gene_presence_locustag_out_processed = os.path.join( roary_processed_folder, "df_gene_presence_locustag.csv" ) df_gene_presence.to_csv(gene_presence_locustag_out_interim) df_gene_presence.to_csv(gene_presence_locustag_out_processed) # Save the gene presence absence binary matrix logging.info("Save the gene presence absence binary matrix") gene_presence_binary_path = os.path.join( roary_interim_folder, "gene_presence_absence.Rtab" ) gene_presence_binary_out_interim = os.path.join( roary_interim_folder, "df_gene_presence_binary.csv" ) gene_presence_binary_out_processed = os.path.join( roary_processed_folder, "df_gene_presence_binary.csv" ) df_gene_presence_binary = pd.read_csv( gene_presence_binary_path, index_col="Gene", sep="\t" ) df_gene_presence_binary.to_csv(gene_presence_binary_out_interim) df_gene_presence_binary.to_csv(gene_presence_binary_out_processed) # Add locus_tag and pangenome_id from pan_genome_reference.fa file logging.info("Add locus_tag and pangenome_id from pan_genome_reference.fa file") pan_fasta_path = os.path.join(roary_interim_folder, "pan_genome_reference.fa") records = SeqIO.parse(pan_fasta_path, format="fasta") for rec in records: locus_tag = rec.id pan_gene_desc = rec.description.split(" ") pan_gene_id = " ".join([pan_gene_id for pan_gene_id in pan_gene_desc[1:]]) if pan_gene_id in df_gene_summary.index: df_gene_summary.loc[pan_gene_id, "locus_tag"] = locus_tag # Add pangenome class and save gene summary table logging.info("Add pangenome class and save gene summary table") total_genomes = df_gene_presence.shape[1] for pan_gene_id in df_gene_summary.index: no_isolates = df_gene_summary.loc[pan_gene_id, "No. isolates"] if no_isolates >= core * total_genomes: df_gene_summary.loc[pan_gene_id, "pangenome_class"] = "core" elif no_isolates >= softcore * total_genomes: df_gene_summary.loc[pan_gene_id, "pangenome_class"] = "softcore" elif no_isolates >= shell * total_genomes: df_gene_summary.loc[pan_gene_id, "pangenome_class"] = "shell" else: df_gene_summary.loc[pan_gene_id, "pangenome_class"] = "cloud" df_gene_summary.to_csv(gene_summary_out_interim) df_gene_summary.to_csv(gene_summary_out_processed) # Copy other output files to processed directory logging.info("Copy other output files to processed directory") copyfile( os.path.join(roary_interim_folder, "conserved_vs_total_genes.png"), os.path.join(roary_processed_folder, "conserved_vs_total_genes.png"), ) copyfile( os.path.join(roary_interim_folder, "Rplots.pdf"), os.path.join(roary_processed_folder, "Rplots.pdf"), ) copyfile( os.path.join(roary_interim_folder, "pan_genome_reference.fa"), os.path.join(roary_processed_folder, "pan_genome_reference.fa"), ) copyfile( os.path.join(roary_interim_folder, "summary_statistics.txt"), os.path.join(roary_processed_folder, "summary_statistics.txt"), ) copyfile( os.path.join(roary_interim_folder, "number_of_conserved_genes.Rtab"), os.path.join(roary_processed_folder, "number_of_conserved_genes.Rtab"), ) copyfile( os.path.join(roary_interim_folder, "number_of_genes_in_pan_genome.Rtab"), os.path.join(roary_processed_folder, "number_of_genes_in_pan_genome.Rtab"), ) return df_gene_presence, df_gene_presence_binary, df_gene_summary def plot_core_pan_curve(roary_interim_folder, roary_processed_folder): """ Plots pangenome curves for core, pan, new and unique gene against number of genomes Parameters ---------- 1. roary_folder : str / path Location of the output from Roary in the interim directory Returns ------- 1. roary_processed_folder : str / path Location of the processed output directory where two plots will be saved """ core_path = os.path.join(roary_interim_folder, "number_of_conserved_genes.Rtab") pan_path = os.path.join(roary_interim_folder, "number_of_genes_in_pan_genome.Rtab") new_path = os.path.join(roary_interim_folder, "number_of_new_genes.Rtab") uniq_path = os.path.join(roary_interim_folder, "number_of_unique_genes.Rtab") df_core_cnt = pd.read_table(core_path, sep="\t") df_pan_cnt = pd.read_table(pan_path, sep="\t") df_new_cnt = pd.read_csv(new_path, sep="\t") df_uniq_cnt = pd.read_csv(uniq_path, sep="\t") avg_core = df_core_cnt.sum(0) / df_core_cnt.shape[0] max_core = df_core_cnt.max(0) min_core = df_core_cnt.min(0) avg_pan = df_pan_cnt.sum(0) / df_core_cnt.shape[0] max_pan = df_pan_cnt.max(0) min_pan = df_pan_cnt.min(0) avg_new = df_new_cnt.sum(0) / df_uniq_cnt.shape[0] max_new = df_new_cnt.max(0) min_new = df_new_cnt.min(0) avg_uniq = df_uniq_cnt.sum(0) / df_uniq_cnt.shape[0] max_uniq = df_uniq_cnt.max(0) min_uniq = df_uniq_cnt.min(0) genome_numbers = list(range(1, df_core_cnt.shape[1] + 1)) # Plot pan and core genome curves in JPEG format fig_pan_core = go.Figure() fig_pan_core.add_trace( go.Scatter( x=genome_numbers, y=avg_core, fill="tonexty", mode="lines+markers", name="Core", hoverinfo="name+x+y", error_y=dict( type="data", symmetric=False, array=max_core - avg_core, arrayminus=avg_core - min_core, ), ) ) # fill down to xaxis fig_pan_core.add_trace( go.Scatter( x=genome_numbers, y=avg_pan, fill="tonexty", mode="lines+markers", name="Pan", hoverinfo="name+x+y", error_y=dict( type="data", symmetric=False, array=max_pan - avg_pan, arrayminus=avg_pan - min_pan, ), ) ) # fill to trace0 y fig_pan_core.update_layout( autosize=False, width=800, height=500, margin=dict(l=20, r=20, b=30, t=30, pad=4), title_text="Pangenome and coregenome curve", xaxis_title_text="#Genomes", yaxis_title_text="#Genes", paper_bgcolor="White", ) fig_pan_core_path = os.path.join(roary_processed_folder, "pan_core_curve.jpeg") if not os.path.isfile(fig_pan_core_path): fig_pan_core.write_image(fig_pan_core_path) # Plot new and unique genome curves in JPEG format fig_new_uniq = go.Figure() fig_new_uniq.add_trace( go.Scatter( x=genome_numbers, y=avg_new, fill="tonexty", mode="lines+markers", name="New", hoverinfo="name+x+y", error_y=dict( type="data", symmetric=False, array=max_new - avg_new, arrayminus=avg_new - min_new, ), ) ) # fill down to xaxis fig_new_uniq.add_trace( go.Scatter( x=genome_numbers, y=avg_uniq, fill="tonexty", mode="lines+markers", name="Unique", hoverinfo="name+x+y", error_y=dict( type="data", symmetric=False, array=max_uniq - avg_uniq, arrayminus=avg_uniq - min_uniq, ), ) ) # fill to trace0 y fig_new_uniq.update_layout( autosize=False, width=800, height=500, margin=dict(l=20, r=20, b=30, t=30, pad=4), title_text="New and unique genes curve", xaxis_title_text="#Genomes", yaxis_title_text="#Genes", paper_bgcolor="White", ) fig_new_uniq_path = os.path.join(roary_processed_folder, "new_unique_curve.jpeg") if not os.path.isfile(fig_new_uniq_path): fig_new_uniq.write_image(fig_new_uniq_path) return def plot_pan_freq_plot(df_gene_presence_binary, roary_processed_folder): """ Plots pangenome frequence plot for number of genes present in number of genomes Parameters ---------- 1. df_gene_presence_binary : pd.DataFrame Dataframe with presence absence matrix of genes in the pangenome Returns ------- 1. roary_processed_folder : str / path Location of the processed output directory where gene frequency plot will be saved """ df_freq = pd.DataFrame(index=df_gene_presence_binary.index, columns=["#Genomes"]) df_freq["#Genomes"] = df_gene_presence_binary.sum(axis=1) nbins = df_gene_presence_binary.shape[1] fig = px.histogram(df_freq, x="#Genomes", nbins=nbins) fig.update_layout( autosize=False, width=800, height=500, margin=dict(l=20, r=20, b=30, t=30, pad=4), title_text="Pangenome frequency plot", # title of plot xaxis_title_text="#Genomes", # xaxis label yaxis_title_text="#Genes", # yaxis label ) fig_out_path = os.path.join(roary_processed_folder, "gene_frequency.jpeg") if not os.path.isfile(fig_out_path): fig.write_image(fig_out_path) return def plot_pan_pie_chart( df_gene_summary, total_genomes, roary_processed_folder, core=0.99, softcore=0.95, shell=0.15, ): """ Plots pangenome frequence plot for number of genes present in number of genomes Parameters ---------- 1. df_gene_summary : pd.DataFrame Dataframe with gene summary including pangenome_class 2. total_genomes : int Number of total genomes in the pangenome 3. roary_processed_folder : str / path Location of the processed output directory for Roary 4. core: int (0 to 1) default = 0.99 Fraction of genomes the gene must be present to be group in core (99% <= strains <= 100%) 5. softcore: int (0 to 1) default = 0.95 Fraction of genomes the gene must be present to be group in softcore (95% <= strains < 99%) 6. shell: int (0 to 1) default = 0.15 Fraction of genomes the gene must be present to be group in softcore (15% <= strains < 95%) Remaining genes will be in cloud Returns ------- 1. roary_processed_folder : str / path Location of the processed output directory where pangenome pie chart plot will be saved """ # Plot the pangenome pie chart plt.figure(figsize=(10, 10)) core_genes = df_gene_summary[df_gene_summary["pangenome_class"] == "core"].shape[0] softcore_genes = df_gene_summary[ df_gene_summary["pangenome_class"] == "softcore" ].shape[0] shell_genes = df_gene_summary[df_gene_summary["pangenome_class"] == "shell"].shape[ 0 ] cloud_genes = df_gene_summary[df_gene_summary["pangenome_class"] == "cloud"].shape[ 0 ] total_genes = df_gene_summary.shape[0] def my_autopct(pct): val = int(pct * total_genes / 100.0) return "{v:d}".format(v=val) fig = plt.pie( [core_genes, softcore_genes, shell_genes, cloud_genes], labels=[ "core (<= %d strains)" % (total_genomes * core), "soft-core (%d <= strains < %d)" % (total_genomes * softcore, total_genomes * core), "shell\n(%d <= strains < %d)" % (total_genomes * softcore, total_genomes * shell), "cloud\n(strains < %d)" % (total_genomes * shell), ], explode=[0.1, 0.05, 0.02, 0], radius=0.9, colors=[ (0, 0, 1, float(x) / total_genes) for x in (core_genes, softcore_genes, shell_genes, cloud_genes) ], autopct=my_autopct, textprops={"fontsize": 10, "fontweight": "bold"}, ) fig_out_path = os.path.join(roary_processed_folder, "pangenome_pie.jpeg") if not os.path.isfile(fig_out_path): plt.savefig( fig_out_path, facecolor="w", edgecolor="w", orientation="portrait", format="jpeg", transparent=False, bbox_inches="tight", pad_inches=0.1, ) return def plot_tree_presence_map( t, df_gene_presence_binary, df_genomes_tree, df_roary_sorted, roary_processed_folder ): """ Plots gene presence heatmap with phylogenetic tree Parameters ---------- 1. tree : Phlyo.tree Phylogenetic tree object of Bio.Phylo from autoMLST 1. df_gene_presence_binary : pd.DataFrame Dataframe with presence absence matrix of genes in the pangenome 2. df_genomes_tree : pd.DataFrame Dataframe with genomes ordered phylogenetically 3. df_roary_sorted : pd.DataFrame Sorted matrix with gene presence across the genomes 4. roary_processed_folder : str / path Location of the processed output directory for Roary Returns ------- 1. heatmap : jpeg Heatmap of gene presence with genomes in phylogenetic order """ # Max distance to create better plot mdist = max([t.distance(t.root, x) for x in t.get_terminals()]) # Considere clustering methods for sorting the matrix # PLot presence/absence matrix against the tree fig = plt.figure(figsize=(30, df_roary_sorted.shape[1] / 6)) ax1 = plt.subplot2grid((1, 40), (0, 10), colspan=30) a = ax1.imshow( df_roary_sorted.T, cmap=plt.cm.Blues, vmin=0, vmax=1, aspect="auto", interpolation="none", ) ax1.set_yticks([]) ax1.set_xticks([]) ax1.axis("off") ax = fig.add_subplot(1, 2, 1) ax = plt.subplot2grid((1, 40), (0, 0), colspan=10) fig.subplots_adjust(wspace=1, hspace=0) ax1.set_title("Roary matrix\n(%d gene clusters)" % df_roary_sorted.shape[0]) strain_dict = dict() for x in t.get_terminals(): if x.name in df_genomes_tree.index: strain_dict[x.name] = df_genomes_tree.loc[x.name, "strain"] else: genome_id = x.name[:-1] + "." + x.name[-1] strain_dict[x.name] = df_genomes_tree.loc[genome_id, "strain"] # Tree labels are extracted from the column 'strain' of df_genmomes_tree Phylo.draw( t, axes=ax, show_confidence=False, label_func=lambda x: strain_dict[x.name] if x.is_terminal() else "", xticks=([],), yticks=([],), ylabel=("",), xlabel=("",), xlim=(-0.01, mdist + 0.01), axis=("off",), title=("autoMLST tree\n(%d strains)" % df_roary_sorted.shape[1],), ) fig_out_path = os.path.join(roary_processed_folder, "phylo_presence_heatmap.jpeg") if not os.path.isfile(fig_out_path): plt.savefig( fig_out_path, facecolor="w", edgecolor="w", orientation="portrait", format="jpeg", transparent=False, bbox_inches="tight", pad_inches=0.1, ) return plt.show() def sort_roary_matrix( df_gene_presence_binary, df_genomes_tree, roary_interim_folder, roary_processed_folder, ): """ Sort roary presence matric- genomes in phylogenetic order and gene in order of the sum of strains present Parameters ---------- 1. df_gene_presence_binary : pd.DataFrame Dataframe with presence absence matrix of genes in the pangenome 2. df_genomes_tree : pd.DataFrame Dataframe with genomes ordered phylogenetically Returns ------- 1. df_roary_sorted : pd.DataFrame Sorted matrix with gene presence across the genomes Saves sorted matrix as csv table """ # Gene presence matrix df_roary = df_gene_presence_binary.copy() # Sort the matrix by the sum of strains presence sorted_idx = df_roary.sum(axis=1).sort_values(ascending=False).index df_roary_sorted = df_roary.loc[sorted_idx] # Sort the matrix according to phylogeny df_roary_sorted = df_roary_sorted[df_genomes_tree.index] roary_sorted_interim = os.path.join( roary_interim_folder, "df_roary_tree_sum_sorted.csv" ) roary_sorted_processed = os.path.join( roary_processed_folder, "df_roary_tree_sum_sorted.csv" ) df_roary_sorted.to_csv(roary_sorted_interim) df_roary_sorted.to_csv(roary_sorted_processed) # TO DO: Considere clustering methods for sorting the matrix return df_roary_sorted if __name__ == "__main__": roary_interim_folder = sys.argv[1] roary_processed_folder = sys.argv[2] automlst_interim_folder = sys.argv[3] df_gene_presence, df_gene_presence_binary, df_gene_summary = get_roary_data( roary_interim_folder, roary_processed_folder, core=0.99, softcore=0.95, shell=0.15, ) plot_core_pan_curve(roary_interim_folder, roary_processed_folder) plot_pan_freq_plot(df_gene_presence_binary, roary_processed_folder) total_genomes = float(df_gene_presence_binary.shape[1]) plot_pan_pie_chart( df_gene_summary, total_genomes, roary_processed_folder, core=0.99, softcore=0.95, shell=0.15, ) # Read tree file newick_path = os.path.join(automlst_interim_folder, "final.newick") t = Phylo.read(newick_path, "newick") genome_tree_table_path = os.path.join( automlst_interim_folder, "df_genomes_tree.csv" ) df_genomes_tree = pd.read_csv(genome_tree_table_path, index_col="genome_id") # Sort the matrix according to phylogeny df_roary_sorted = sort_roary_matrix( df_gene_presence_binary, df_genomes_tree, roary_interim_folder, roary_processed_folder, ) plot_tree_presence_map( t, df_gene_presence_binary, df_genomes_tree, df_roary_sorted, roary_processed_folder, ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 | import json import os import sys from shutil import copyfile import pandas as pd from Bio import Phylo def copy_automlst_out(automlst_interim_folder, automlst_processed_folder): """ Copy important files from autoMLST interim to proecessed directory Parameters ---------- 1. automlst_interim_folder : str / path Location of the output from autoMLST in the interim directory Returns ------- 1. automlst_processed_folder : str / path Location of the processed output directory for important autoMLST results """ files_to_copy = [ "raxmlpart.txt", "raxmlpart.txt.bionj", "raxmlpart.txt.contree", "raxmlpart.txt.iqtree", "raxmlpart.txt.log", "raxmlpart.txt.mldist", "raxmlpart.txt.treefile", ] for tree_file in files_to_copy: from_path = os.path.join(automlst_interim_folder, tree_file) to_path = os.path.join(automlst_processed_folder, tree_file) if os.path.isfile(from_path): copyfile(from_path, to_path) # Selected tree newick file for downstream analysis by default automlst_wrapper treefile = os.path.join(automlst_interim_folder, "raxmlpart.txt.treefile") out_newick_processed = os.path.join(automlst_processed_folder, "final.newick") out_newick_interim = os.path.join(automlst_interim_folder, "final.newick") if os.path.isfile(treefile): copyfile(treefile, out_newick_interim) copyfile(treefile, out_newick_processed) # Save list of MLST genes in pandas table (currently with only index with TIGR IDs) aligned_dir = os.path.join(automlst_interim_folder, "aligned") mlst_out_path_processed = os.path.join( automlst_processed_folder, "df_mlst_genes.csv" ) mlst_out_path_interim = os.path.join(automlst_interim_folder, "df_mlst_genes.csv") mlst_gene_list = os.listdir(aligned_dir) df_mlst = pd.DataFrame(index=mlst_gene_list, columns=["aligned_path"]) for mlst_gene in df_mlst.index: df_mlst.loc[mlst_gene, "aligned_path"] = os.path.join( aligned_dir, mlst_gene + ".faa" ) df_mlst.index.name = "mlst_gene" df_mlst.to_csv(mlst_out_path_interim) df_mlst.to_csv(mlst_out_path_processed) return None def get_genome_tree_table( automlst_processed_folder, prokka_interim_folder, gtdb_interim_folder ): """ Copy important files from autoMLST interim to proecessed directory Parameters ---------- 1. automlst_processed_folder : str / path Location of the processed output directory for important autoMLST results 2. prokka_interim_folder : str/ path Path to get organism information for each genome as used in prokka 3. gtdb_interim_folder : str/ path Path to get json files with gtdb information Returns ------- 1. df_genomes_tree : pd.DataFrame Combined datadrame of organism info ordered in phylogenetic tree (final.newick file) """ newick_path = os.path.join(automlst_processed_folder, "final.newick") t = Phylo.read(newick_path, "newick") # Max distance to create better plot # mdist = max([t.distance(t.root, x) for x in t.get_terminals()]) # Sort the matrix according to tip labels in the tree genome_ids_list = [ x.name if x.name in os.listdir(prokka_interim_folder) else x.name[:-1] + "." + x.name[-1] for x in t.get_terminals() ] # The above is required in the default version of autmlst_wrapper # To be fixed in the orginal repo columns_list = [ "genus_original", "species_original", "strain", "domain", "phylum", "class", "order", "family", "genus", "species", ] df_genomes_tree = pd.DataFrame(index=genome_ids_list, columns=columns_list) df_genomes_tree.index.name = "genome_id" for genome_id in df_genomes_tree.index: # Reading organism infor used for prokka run including strain ID org_info_path = os.path.join( prokka_interim_folder, genome_id, "organism_info.txt" ) with open(org_info_path, "r") as file_obj: org_info = file_obj.readlines()[0] df_genomes_tree.loc[genome_id, "genus_original"] = org_info.split(",")[0] df_genomes_tree.loc[genome_id, "species_original"] = org_info.split(",")[1] df_genomes_tree.loc[genome_id, "strain"] = org_info.split(",")[2] # Reading gtdb metadata from JSON files gtdb_json_path = os.path.join(gtdb_interim_folder, genome_id + ".json") with open(gtdb_json_path, "r") as f: gtdb_dict = json.load(f) df_genomes_tree.loc[genome_id, "domain"] = gtdb_dict["gtdb_taxonomy"][ "domain" ] df_genomes_tree.loc[genome_id, "phylum"] = gtdb_dict["gtdb_taxonomy"][ "phylum" ] df_genomes_tree.loc[genome_id, "class"] = gtdb_dict["gtdb_taxonomy"][ "class" ] df_genomes_tree.loc[genome_id, "order"] = gtdb_dict["gtdb_taxonomy"][ "order" ] df_genomes_tree.loc[genome_id, "family"] = gtdb_dict["gtdb_taxonomy"][ "family" ] df_genomes_tree.loc[genome_id, "genus"] = gtdb_dict["gtdb_taxonomy"][ "genus" ] df_genomes_tree.loc[genome_id, "organism"] = gtdb_dict["gtdb_taxonomy"][ "species" ] try: df_genomes_tree.loc[genome_id, "species"] = gtdb_dict["gtdb_taxonomy"][ "species" ].split(" ")[1] except IndexError: # leave blank for empty taxonomy df_genomes_tree.loc[genome_id, "species"] = "" genomes_tree_path_interim = os.path.join( automlst_processed_folder, "df_genomes_tree.csv" ) genomes_tree_path_processed = os.path.join( automlst_processed_folder, "df_genomes_tree.csv" ) df_genomes_tree.to_csv(genomes_tree_path_interim) df_genomes_tree.to_csv(genomes_tree_path_processed) return df_genomes_tree if __name__ == "__main__": automlst_interim_folder = sys.argv[1] automlst_processed_folder = sys.argv[2] prokka_interim_folder = sys.argv[3] gtdb_interim_folder = sys.argv[4] copy_automlst_out(automlst_interim_folder, automlst_processed_folder) df_genomes_tree = get_genome_tree_table( automlst_processed_folder, prokka_interim_folder, gtdb_interim_folder ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | import glob import gzip import io import logging import sys from pathlib import Path import ncbi_genome_download as ngd import pandas as pd log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.INFO) def build_prokka_refgbff(name, prokka_db_table, outfile): """ Given a project name and csv table path, download all NCBI accession id and returns a concatenated gbk file for prokka --proteins params. """ download_dir = Path(f"resources/prokka_db/{name}") prokka_db_table = Path(prokka_db_table).resolve() # Generate download directory download_dir.mkdir(parents=True, exist_ok=True) # Get lists of accession ids and its sources logging.info(f"Reading input file: {prokka_db_table}...") if prokka_db_table.suffix == ".csv": df = pd.read_csv(prokka_db_table) elif prokka_db_table.suffix == ".json": df = pd.read_json(prokka_db_table) ngd_input = {"refseq": [], "genbank": []} logging.info("Donwloading genomes from NCBI...") for acc in df.Accession: if acc.startswith("GCF"): ngd_input["refseq"].append(acc) elif acc.startswith("GCA"): ngd_input["genbank"].append(acc) else: raise # Download gbff with NCBI genome download for s in ngd_input.keys(): if ngd_input[s]: acc_list = ",".join(ngd_input[s]) logging.debug(f"Downloading {s} genome: {acc}") ngd.download( section=s, file_formats="genbank", assembly_accessions=acc_list, output=download_dir, groups="bacteria", ) # Concatenate gbff logging.info("Concatenating genomes into a single file...") reference_files = glob.glob(f"{download_dir}/*/*/*/*.gbff.gz") with open(outfile, "w") as f_out: for names in reference_files: with io.TextIOWrapper(gzip.open(names, "r")) as infile: f = infile.read() f_out.write(f) f_out.write("\n") logging.info(f"Reference genomes saved as: {str(f_out)}") return if __name__ == "__main__": build_prokka_refgbff(sys.argv[1], sys.argv[2], sys.argv[3]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | import json import logging import sys from pathlib import Path import pandas as pd from alive_progress import alive_bar log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def gather_seqfu_json(seqfu_input, output_file): """ Merge seqfu json input into one csv table """ # Accomodate multiple inputs to generate dataframe shell_input = Path(seqfu_input) logging.info(shell_input) if shell_input.is_file() and shell_input.suffix == ".json": logging.info(f"Getting SeqFu overview from a single file: {shell_input}") shell_input_files = shell_input elif shell_input.is_file() and shell_input.suffix == ".txt": logging.info(f"Getting SeqFu overview from a text file: {shell_input}") with open(shell_input, "r") as file: file_content = [i.strip("\n") for i in file.readlines()] if len(file_content) == 1: # Paths space-separated on a single line paths = file_content[0].split() else: # Paths written on separate lines paths = file_content shell_input_files = [ Path(path) for path in paths if Path(path).suffix == ".json" ] else: shell_input_files = [ Path(file) for file in str(shell_input).split() if Path(file).suffix == ".json" ] logging.info( f"Getting SeqFu overview from the given list of {len(shell_input_files)} files..." ) logging.info("Merging SeqFu results...") container = {} with alive_bar(len(shell_input_files), title="Updating SeqFu information:") as bar: for d in shell_input_files: logging.debug(f"Reading input: {Path(d).name}") with open(d, "r") as f: read = json.load(f) genome_id = read[0]["Filename"] read[0].pop("Filename") container[genome_id] = read[0] bar() logging.info("Converting dictionary to table...") df = pd.DataFrame.from_dict(container).T df.index.name = "genome_id" df.to_csv(output_file) logging.info("Job done!") return if __name__ == "__main__": gather_seqfu_json(sys.argv[1], sys.argv[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | import json import logging import sqlite3 import sys from pathlib import Path import pandas as pd from alive_progress import alive_bar log_format = "%(levelname)-8s %(asctime)s %(message)s" date_format = "%d/%m %H:%M:%S" logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG) def gcf_hits(df_gcf, cutoff=900): """ Filter bigslice result based on distance threshold to model and generate data. """ mask = df_gcf.loc[:, "membership_value"] <= cutoff df_gcf_filtered = df_gcf[mask] bgcs = df_gcf_filtered.bgc_id.unique() gcfs = df_gcf_filtered.gcf_id.unique() logging.debug(f"BiG-SLICE query with BiG-FAM run 6, distance cutoff {cutoff}") logging.debug(f"Number of bgc hits : {len(bgcs)}/{len(df_gcf.bgc_id.unique())}") logging.debug(f"Number of GCF hits : {len(gcfs)}") return df_gcf_filtered def grab_gcf_model_summary(gcf_query, database): """ Summarize gcf model from a give list of bigfam gcf ids gcf_query = list of bigfam gfc id database = the full run database of bigslice """ def grab_mibig_members(q, mibig_bigfam): """ Summarize information of mibig members in a model """ mibig_members = list(set(q.bgc_id) & set(mibig_bigfam.id)) mibig_members = [mibig_bigfam.loc[i, "name"] for i in mibig_members] return mibig_members # initiate connection logging.info(f"Initiating SQL connection to {database}...") conn = sqlite3.connect(database) # load main tables logging.info("Reading gcf_membership table...") df_bigfam = pd.read_sql_query("select * from gcf_membership;", conn) logging.info("Reading bgc table...") df_bgc_bigfam = pd.read_sql_query("select * from bgc;", conn) # Load BiG-FAM dataset mibig_bigfam = df_bgc_bigfam[df_bgc_bigfam.loc[:, "type"] == "mibig"].set_index( "id", drop=False ) # Filter for hit queries # df_bigfam_query = df_bigfam[df_bigfam.loc[:, "gcf_id"].isin(gcf_query)] # return information of each model logging.info(f"Summarizing information for {len(gcf_query)} GCF models...") summary = {} with alive_bar(len(gcf_query)) as bar: for g in gcf_query: values = {} q = df_bigfam[df_bigfam.loc[:, "gcf_id"] == g] q = q[q.loc[:, "rank"] == 0] # select only first hits # get core member info q_core = q[q.loc[:, "membership_value"] <= 900] q_core_mibig = grab_mibig_members(q_core, mibig_bigfam) # get putative member info q_putative = q[q.loc[:, "membership_value"] > 900] q_putative_mibig = grab_mibig_members(q_putative, mibig_bigfam) values["core_member"] = len(q_core) values["putative_member"] = len(q_putative) values["core_member_mibig"] = q_core_mibig values["putative_member_mibig"] = q_putative_mibig values["core_member_mibig_count"] = len(q_core_mibig) values["core_member_mibig_bool"] = len(q_core_mibig) > 0 values["putative_member_mibig_count"] = len(q_putative_mibig) values["putative_member_mibig_bool"] = len(q_putative_mibig) > 0 values[ "link to BiG-FAM" ] = f"https://bigfam.bioinformatics.nl/run/6/gcf/{g}" summary[str(g)] = values bar() return summary def summarize_bigslice_query( bigslice_query_path, output_path, database_path="resources/bigslice/full_run_result/result/data.db", cutoff=900, ): """ Summarize bigslice query result. Input: - bigslice_query_path Output: - A folder to store two output files: 1. query_network.csv to build Cytoscape Network 2. JSON file summarizing GCF model hits """ bigslice_query = Path(bigslice_query_path) # database = Path(database_path) output = Path(output_path) output.mkdir(parents=True, exist_ok=True) df_gcf_membership = pd.read_csv(bigslice_query / "gcf_membership.csv") df_bgc = pd.read_csv(bigslice_query / "bgc.csv") # Create edge table network for cytoscape enriched with ids for mapping metadata (bgc_name, genome_id) logging.info("Generating network table of BiG-SLICE hits...") data = gcf_hits(df_gcf_membership, cutoff) bgc_info = df_bgc.set_index("id", drop=True) bgc_info["bgc_id"] = [str(i).replace(".gbk", "") for i in bgc_info["orig_filename"]] for i in data.index: bgc_id = data.loc[i, "bgc_id"] data.loc[i, "bgc_id"] = str(bgc_info.loc[bgc_id, "bgc_id"]) data.to_csv(output / "query_network.csv", index=False) # Summarizing GCF model hits logging.info("Summarizing GCF model hits...") gcf_query = list(data.gcf_id.unique()) gcf_summary = grab_gcf_model_summary(gcf_query, database_path) # outputting as json with open(output / "gcf_summary.json", "w") as outfile: json.dump(gcf_summary, outfile, indent=4) # outputting as csv table as_table = pd.DataFrame.from_dict(gcf_summary).T as_table.index.name = "gcf_id" as_table.to_csv(output / "gcf_summary.csv") logging.info("Job done") return if __name__ == "__main__": summarize_bigslice_query(sys.argv[1], sys.argv[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import sys from Bio import SeqIO def update_gbk_automlst(input_gbk, out_gbk, genome_id): """ Update organism field with accession ID for autoMLST run """ input_handle = open(input_gbk, "r") new_seq_records = [] for seq_record in SeqIO.parse(input_handle, "genbank"): seq_record.annotations["organism"] = genome_id new_seq_records.append(seq_record) SeqIO.write(new_seq_records, out_gbk, "genbank") input_handle.close() return None if __name__ == "__main__": update_gbk_automlst(sys.argv[1], sys.argv[2], sys.argv[3]) |
9 10 11 12 13 14 | shell: """ download-antismash-databases --database-dir resources/antismash_db 2>> {log} antismash --version >> {log} antismash --check-prereqs >> {log} """ |
33 34 35 36 | shell: """ antismash --genefinding-tool {params.genefinding} --output-dir {params.folder} --cb-general --cb-subclusters --cb-knownclusters -c {threads} {input.gbk} --logfile {log} 2>> {log} """ |
46 47 48 49 50 51 | shell: """ download-antismash-databases --database-dir {output} &>> {log} antismash --version >> {log} antismash --database {output} --prepare-data &>> {log} """ |
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | shell: """ set +e # Find the latest existing JSON output for this strain latest_version=$(ls -d data/interim/antismash/*/{wildcards.strains}/{wildcards.strains}.json | grep {wildcards.strains} | sort -r | head -n 1 | cut -d '/' -f 4) 2>> {log} if [ -n "$latest_version" ]; then # Use existing JSON result as starting point old_json="data/interim/antismash/$latest_version/{wildcards.strains}/{wildcards.strains}.json" echo "Using existing JSON from $old_json as starting point..." >> {log} antismash_input="--reuse-result $old_json" else # No existing JSON result found, use genbank input echo "No existing JSON result found, starting AntiSMASH from scratch..." >> {log} antismash_input="{input.gbk}" fi # Run AntiSMASH antismash --genefinding-tool {params.genefinding} --output-dir {params.folder} \ --database {input.resources} \ --cb-general --cb-subclusters --cb-knownclusters -c {threads} $antismash_input --logfile {log} 2>> {log} # Check if the run failed due to changed detection results if grep -q "ValueError: Detection results have changed. No results can be reused" {log}; then # Use genbank input instead echo "Previous JSON result is invalid, starting AntiSMASH from scratch..." >> {log} antismash --genefinding-tool {params.genefinding} --output-dir {params.folder} \ --database {input.resources} \ --cb-general --cb-subclusters --cb-knownclusters -c {threads} {input.gbk} --logfile {log} 2>> {log} fi """ |
112 113 114 115 116 117 | shell: """ base_dir=$PWD mkdir {output.dir} (cd {output.dir} && for item in $(ls $base_dir/{input.dir}); do ln -s $base_dir/{input.dir}/$item $(basename $item); done) 2>> {log} """ |
15 16 17 18 19 | shell: """ mkdir -p data/interim/arts/antismash-{wildcards.version}/{wildcards.strains} python {params.resources}/artspipeline1.py {input.antismash} {params.ref} -rd data/interim/arts/antismash-{wildcards.version}/{wildcards.strains} -cpu {threads} -opt kres,phyl,duf -khmms {params.khmms} 2>> {log} """ |
31 32 33 34 | shell: """ python workflow/bgcflow/bgcflow/data/arts_extract.py {input.arts}/tables/bgctable.tsv {input.json} {output.json} {wildcards.strains} 2>> {log} """ |
50 51 52 53 54 55 56 57 58 | shell: """ TMPDIR="data/interim/tmp/{wildcards.name}/{wildcards.version}" mkdir -p $TMPDIR INPUT_JSON="$TMPDIR/df_arts.txt" echo '{input.json}' > $INPUT_JSON python workflow/bgcflow/bgcflow/data/arts_gather.py $INPUT_JSON {input.changelog} {output.table} 2>> {log} rm $INPUT_JSON """ |
12 13 14 15 16 17 18 19 20 21 | shell: """ set -e mkdir -p resources 2>> {log} wget {params.source}/archive/refs/tags/v{params.version}.zip -O resources/automlst-simplified-wrapper-v{params.version}.zip 2>> {log} (cd resources && unzip -o automlst-simplified-wrapper-v{params.version}.zip && rm automlst-simplified-wrapper-v{params.version}.zip) &>> {log} (cd resources/automlst-simplified-wrapper-{params.version} && unzip -o reducedcore.zip) &>> {log} cp -r resources/automlst-simplified-wrapper-{params.version}/* {output.folder}/. 2>> {log} rm -rf resources/automlst-simplified-wrapper-{params.version} 2>> {log} """ |
33 34 35 36 | shell: """ python workflow/bgcflow/bgcflow/features/prep_automlst.py {input.gbk} {output.auto_gbk} {wildcards.strains} 2>> {log} """ |
52 53 54 55 56 | shell: """ mkdir -p "data/interim/automlst_wrapper/{wildcards.name}/singles" 2>> {log} python resources/automlst-simplified-wrapper-main/simplified_wrapper.py data/interim/automlst_wrapper/{wildcards.name} {threads} &>> {log} """ |
76 77 78 79 | shell: """ python workflow/bgcflow/bgcflow/data/make_phylo_tree.py {params.automlst_interim} {output.automlst_processed} {params.prokka_interim} {params.gtdb_interim} 2>> {log} """ |
10 11 12 13 | shell: """ python workflow/bgcflow/bgcflow/data/get_bgc_counts.py {input.antismash} {wildcards.strains} {output.bgc_count} 2>> {log} """ |
24 25 26 27 | shell: """ python workflow/bgcflow/bgcflow/data/get_antismash_overview.py {input.antismash} {output.bgc_table} {wildcards.strains} 2>> {log} """ |
43 44 45 46 47 48 49 50 51 52 | shell: """ TMPDIR="data/interim/tmp/{wildcards.name}/{wildcards.version}" mkdir -p $TMPDIR INPUT_JSON="$TMPDIR/df_regions_antismash.txt" echo '{input.bgc_overview}' > $INPUT_JSON python workflow/bgcflow/bgcflow/data/get_antismash_overview_gather.py \ $INPUT_JSON {input.mapping_dir} {output.df_bgc} 2>> {log} rm $INPUT_JSON """ |
62 63 64 65 66 | shell: """ mkdir -p {output.mapping_dir} 2>> {log} cp {input.mapping_dir}/*/*-change_log.json {output.mapping_dir}/. 2>> {log} """ |
96 97 98 99 100 101 102 103 104 | shell: """ TMPDIR="data/interim/tmp/{wildcards.name}/{wildcards.version}" mkdir -p $TMPDIR INPUT_JSON="$TMPDIR/df_bgc_counts.txt" echo '{input.bgc_count}' > $INPUT_JSON python workflow/bgcflow/bgcflow/data/make_genome_dataset.py $INPUT_JSON '{params.df_samples}' {output.df_antismash_summary} 2>> {log} rm $INPUT_JSON """ |
115 116 117 118 | shell: """ python workflow/bgcflow/bgcflow/data/get_dependencies.py {output} 2> {log} """ |
129 130 131 132 | shell: """ python workflow/bgcflow/bgcflow/data/get_project_metadata.py {wildcards.name} {output} {params.bgcflow_version} 2> {log} """ |
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | shell: """ echo "Preparing BGCs for {wildcards.name} downstream analysis..." 2>> {log.general} #mkdir -p {output.outdir} 2>> {log.general} # Generate symlink for each regions in genomes in dataset for i in $(dirname {input.gbk}) do echo Processing $i 2>> {log.symlink} python workflow/bgcflow/bgcflow/data/bgc_downstream_prep.py $i {output.outdir} 2>> {log.symlink} done # generate taxonomic information for dataset python workflow/bgcflow/bgcflow/data/bigslice_prep.py {input.table} {output.taxonomy} 2>> {log.taxonomy} # append new dataset information ## check if previous dataset exists if [[ -s {params.dataset} ]] then echo "Previous dataset detected, appending dataset information for {wildcards.name}..." sed -i 'a {wildcards.name}_antismash_{wildcards.version}\t{wildcards.name}_antismash_{wildcards.version}\ttaxonomy/taxonomy_{wildcards.name}_antismash_{wildcards.version}.tsv\t{wildcards.name}' {params.dataset} 2>> {log.general} else echo "No previous dataset detected, generating dataset information for {wildcards.name}..." echo -e '# Dataset name\tPath to folder\tPath to taxonomy\tDescription' > {params.dataset} 2>> {log.general} sed -i 'a {wildcards.name}_antismash_{wildcards.version}\t{wildcards.name}_antismash_{wildcards.version}\ttaxonomy/taxonomy_{wildcards.name}_antismash_{wildcards.version}.tsv\t{wildcards.name}' {params.dataset} 2>> {log.general} fi # generate mapping for visualization python workflow/bgcflow/bgcflow/data/get_bigscape_mapping.py {output.outdir} {output.bgc_mapping} 2>> {log.general} """ |
10 11 12 13 14 15 16 | shell: """ (cd resources && wget https://github.com/medema-group/BiG-SCAPE/archive/refs/tags/v{params.release}.zip) &>> {log} (cd resources && unzip -o v{params.release}.zip && mv BiG-SCAPE-{params.release}/ BiG-SCAPE/ && rm v{params.release}.zip) &>> {log} (cd resources/BiG-SCAPE && wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.hmm.gz && gunzip Pfam-A.hmm.gz) &>> {log} (cd resources/BiG-SCAPE && hmmpress Pfam-A.hmm) &>> {log} """ |
31 32 33 34 35 36 | shell: """ (cd resources && wget https://dl.secondarymetabolites.org/mibig/mibig_json_{params.mibig_version}.tar.gz) &>> {log} (cd resources && tar -xvf mibig_json_{params.mibig_version}.tar.gz && mkdir -p mibig && mv mibig_json_{params.mibig_version}/ mibig/json && rm mibig_json_{params.mibig_version}.tar.gz) &>> {log} python workflow/bgcflow/bgcflow/data/get_mibig_data.py {output.mibig_json_folder} {output.mibig_bgc_table} 2>> {log} """ |
54 55 56 57 | shell: """ python {input.bigscape}/bigscape.py -i {input.antismash_dir} -o {params.bigscape_dir} -c {threads} --cutoff 0.3 0.4 0.5 --include_singletons --label {params.label} --hybrids-off --mibig --verbose &>> {log} """ |
76 77 78 79 | shell: """ python {input.bigscape}/bigscape.py -i {input.antismash_dir} -o {params.bigscape_dir} -c {threads} --cutoff 0.3 0.4 0.5 --include_singletons --label {params.label} --hybrids-off --verbose --no_classify --mix &>> {log} """ |
91 92 93 94 95 96 | shell: """ topdir=$PWD (cd data/interim/bigscape && zip -r $topdir/{output.zip} no_mibig_{wildcards.name}_antismash_{wildcards.version} \ -x {wildcards.name}_antismash_{wildcards.version}/cache/**\* &>> $topdir/{log}) """ |
115 116 117 118 119 | shell: """ python workflow/bgcflow/bgcflow/data/make_bigscape_to_cytoscape.py data/interim/bigscape/{wildcards.name}_antismash_{wildcards.version}/ {input.as_dir} {input.df_genomes_path} {input.mibig_bgc_table} {output.output_dir} 2>> {log} cp {input.bgc_mapping} {output.bgc_mapping} """ |
132 133 134 135 | shell: """ python workflow/bgcflow/bgcflow/data/bigscape_copy.py {input.index} data/processed/{wildcards.name}/bigscape/result_as{wildcards.version} 2>> {log} """ |
13 14 15 16 | shell: """ python workflow/bgcflow/bgcflow/data/bigslice_prep_single_project.py {input.dir} {input.taxonomy} {params.project_name} {output.folder} 2>> {log} """ |
33 34 35 36 | shell: """ bigslice -i {input.dir} {output.folder} --threshold {params.threshold} -t {threads} &>> {log} """ |
46 47 48 49 50 | shell: """ (cd resources/bigslice && wget -c -nc http://bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/data/full_run_result.zip && unzip full_run_result.zip) &>> {log} rm resources/bigslice/full_run_result.zip &>> {log} """ |
70 71 72 73 74 75 | shell: """ TIMESTAMP=$(date --iso-8601=hours) bigslice --query {input.tmp_dir} --n_ranks {params.n_ranks} {input.bigslice_dir} -t {threads} --query_name {params.query_name}_$TIMESTAMP --run_id {params.run_id} &>> {log} python workflow/bgcflow/bgcflow/data/get_bigslice_query_result.py {params.query_name} {output.folder} {input.bigslice_dir} &>> {log} """ |
90 91 92 93 | shell: """ python workflow/bgcflow/bgcflow/data/summarize_bigslice_query.py {input.query_dir} {output.folder} {params.bigfam_db_path} {params.cutoff} 2>> {log} """ |
19 20 21 22 23 24 25 26 27 | shell: """ cblaster --version >> {output.version} diamond --version >> {output.version} cat {output.version} >> {log} cblaster config --email dummy@cblaster.com 2>> {log} cblaster makedb --cpus {threads} -b {params.batch_size} -n {params.db_prefix} {input.gbk} 2>> {log} cp -r {output.folder_interim} {output.folder_processed} 2>> {log} """ |
48 49 50 51 52 53 | shell: """ cblaster config --email dummy@cblaster.com cblaster makedb --cpus {threads} -b {params.batch_size} -n {params.db_prefix} {params.antismash_dir} 2>> {log} cp -r {output.folder_interim} {output.folder_processed} 2>> {log} """ |
8 9 10 11 12 13 | shell: """ (cd resources && wget -O checkm_data_2015_01_16.tar.gz https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz) &>> {log} (cd resources && mkdir -p checkm && tar -xvf checkm_data_2015_01_16.tar.gz -C checkm && rm checkm_data_2015_01_16.tar.gz) &>> {log} checkm data setRoot resources/checkm &>> {log} """ |
31 32 33 34 35 36 | shell: """ mkdir -p {output.fna} for f in {input.fna}; do cp $f {output.fna}/.; done checkm lineage_wf -t {threads} --reduced_tree -x fna {output.fna} {output.checkm_dir} &>> {log} """ |
54 55 56 57 58 | shell: """ mkdir -p {params.checkm_json} python workflow/bgcflow/bgcflow/data/get_checkm_data.py {input.stat} {params.checkm_json} {output.stat_processed} 2>> {log} """ |
10 11 12 13 14 | shell: """ python workflow/bgcflow/bgcflow/database/csv_to_parquet.py data/processed/{wildcards.name} 2>> {log} touch {output.parquet} """ |
9 10 11 12 | shell: """ git clone https://bitbucket.org/kaistsystemsbiology/deeptfactor.git {output.folder} 2>> {log} """ |
28 29 30 31 32 33 34 35 | shell: """ workdir=$PWD mkdir -p data/interim/deeptfactor/{wildcards.strains} 2>> {log} (cd {input.resource} && python tf_running.py \ -i $workdir/{input.faa} -o $workdir/{params.outdir} \ -g cpu -cpu {threads}) 2>> {log} """ |
47 48 49 50 | shell: """ python workflow/bgcflow/bgcflow/data/deeptfactor_scatter.py {input.deeptfactor} {output.deeptfactor_json} 2>> {log} """ |
65 66 67 68 69 70 71 72 73 | shell: """ TMPDIR="data/interim/tmp/{wildcards.name}" mkdir -p $TMPDIR INPUT_JSON="$TMPDIR/df_deeptfactor.txt" echo '{input.deeptfactor_json}' > $INPUT_JSON python workflow/bgcflow/bgcflow/data/deeptfactor_gather.py $INPUT_JSON {output.df_deeptfactor} 2>> {log} rm $INPUT_JSON """ |
15 16 17 18 19 20 | shell: """ cat {input.faa} > {output.faa_compiled} 2>> {log} diamond makedb --in {output.faa_compiled} -d {params.diamond} -p {threads} 2>> {log} cp {output.diamond_interim} {output.diamond_processed} 2>> {log} """ |
9 10 11 12 13 | shell: """ download_eggnog_data.py --data_dir {output.eggnog_db} -y &>> {log} create_dbs.py -m diamond --dbname bacteria --taxa Bacteria --data_dir {output.eggnog_db} -y &>> {log} """ |
28 29 30 31 32 | shell: """ mkdir -p {output.eggnog_dir} emapper.py -i {input.faa} --decorate_gff "yes" --excel --cpu {threads} -o {wildcards.strains} --output_dir {output.eggnog_dir} --data_dir {input.eggnog_db} &>> {log} """ |
13 14 15 16 17 18 19 20 | shell: """ for fna in {input.fna} do echo $fna >> {output.fastani_infile} done fastANI --ql {output.fastani_infile} --rl {output.fastani_infile} -t {threads} --matrix -o {output.fastani_out} 2>> {log} """ |
32 33 34 35 | shell: """ python workflow/bgcflow/bgcflow/data/convert_triangular_matrix.py {input.fastani_matrix} {output.df_fastani} 2>> {log} """ |
24 25 26 27 | shell: """ python workflow/bgcflow/bgcflow/data/gtdb_prep.py {wildcards.strains} {output.gtdb_json} '{params.samples_path}' '{params.gtdb_paths}' {params.version} 2> {log} """ |
47 48 49 50 51 52 53 54 55 | shell: """ TMPDIR="data/interim/tmp/{wildcards.name}" mkdir -p $TMPDIR INPUT_JSON="$TMPDIR/df_gtdb.txt" echo '{input.json_list}' > $INPUT_JSON python workflow/bgcflow/bgcflow/data/fix_gtdb_taxonomy.py $INPUT_JSON {output.meta} 2> {log} rm $INPUT_JSON """ |
37 38 39 40 41 | shell: """ (cd resources && wget https://data.gtdb.ecogenomic.org/releases/release{params.release_major}/{params.release_major}.{params.release_minor}/auxillary_files/gtdbtk_{params.release_version}_data.tar.gz -nc) 2>> {log} (cd resources && mkdir -p gtdbtk && tar -xvzf gtdbtk_{params.release_version}_data.tar.gz -C "gtdbtk" --strip 1 && rm gtdbtk_{params.release_version}_data.tar.gz) &>> {log} """ |
53 54 55 56 57 58 59 60 61 62 63 64 | shell: """ TMPDIR="data/interim/tmp/{wildcards.name}" mkdir -p $TMPDIR INPUT_FNA="$TMPDIR/df_fna_gtdbtk.txt" INPUT_JSON="$TMPDIR/df_json_gtdbtk.txt" echo '{input.fna}' > $INPUT_FNA echo '{input.json_list}' > $INPUT_JSON python workflow/bgcflow/bgcflow/data/gtdbtk_prep.py $INPUT_FNA $INPUT_JSON {output.fnadir} 2>> {log} rm $INPUT_FNA rm $INPUT_JSON """ |
83 84 85 86 87 88 | shell: """ mkdir -p {output.tmpdir} gtdbtk classify_wf --genome_dir {input.fnadir} --out_dir {output.gtdbtk_dir} --cpus {threads} --pplacer_cpus 1 --tmpdir {output.tmpdir} {params.ani_screen} &>> {log} cp {output.summary_interim} {output.summary_processed} """ |
12 13 14 15 16 17 18 19 | shell: """ for fna in {input.fna} do echo $fna >> {output.mash_infile} done (mash triangle -p {threads} -l {output.mash_infile} >> {output.triangle_dist}) 2>> {log} """ |
31 32 33 34 | shell: """ python workflow/bgcflow/bgcflow/data/convert_triangular_matrix.py {input.mash_matrix} {output.df_mash} 2>> {log} """ |
10 11 12 13 | shell: """ mlst --csv {input.fna} > {output.csv} 2>> {log} """ |
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | shell: """ if [[ {wildcards.ncbi} == GCF* ]] then source="refseq" elif [[ {wildcards.ncbi} == GCA* ]] then source="genbank" else echo "accession must start with GCA or GCF" >> {log} fi ncbi-genome-download -s $source -F fasta,assembly-report -A {wildcards.ncbi} -o data/raw/ncbi/download -P -N --verbose bacteria 2>> {log} gunzip -c data/raw/ncbi/download/$source/bacteria/{wildcards.ncbi}/*.fna.gz > {output.fna} cp data/raw/ncbi/download/$source/bacteria/{wildcards.ncbi}/*report.txt {output.assembly_report} rm -rf data/raw/ncbi/download/$source/bacteria/{wildcards.ncbi} python workflow/bgcflow/bgcflow/data/get_assembly_information.py {output.assembly_report} {output.json_report} {wildcards.ncbi} 2>> {log} """ |
48 49 50 51 52 53 54 55 56 57 | shell: """ TMPDIR="data/interim/tmp/{wildcards.name}" mkdir -p $TMPDIR INPUT_JSON="$TMPDIR/df_ncbi_meta_input.txt" echo '{input.all_json}' > $INPUT_JSON python workflow/bgcflow/bgcflow/data/extract_ncbi_information.py \ $INPUT_JSON {output.ncbi_meta_path} 2>> {log} rm $INPUT_JSON """ |
67 68 69 70 71 | shell: """ wget ftp://ftp.patricbrc.org/RELEASE_NOTES/genome_summary -O {output.patric_genome_summary} 2>> {log} wget ftp://ftp.patricbrc.org/RELEASE_NOTES/genome_metadata -O {output.patric_genome_metadata} 2>> {log} """ |
89 90 91 92 93 | shell: """ python workflow/bgcflow/bgcflow/data/extract_patric_meta.py \ {input.ncbi_meta_path} {input.patric_genome_summary} {input.patric_genome_metadata} {output.patric_meta_path} 2>> {log} """ |
12 13 14 15 | shell: """ (cd data/interim/fasta && wget -qN "ftp://ftp.patricbrc.org/genomes/{wildcards.patric}/{wildcards.patric}.fna" 2> ../../../{log}) """ |
11 12 13 14 15 | shell: """ ln -s $PWD/resources/RNAmmer/rnammer $CONDA_PREFIX/bin/rnammer 2>> {log} rnammer -S bac -m lsu,ssu,tsu -gff - resources/RNAmmer/example/ecoli.fsa >> {output} """ |
31 32 33 34 35 36 37 38 39 40 | shell: """ if [[ {input} == *.fna || {input} == *.fasta || {input} == *.fa ]] then cp {input} {output} 2>> {log} else echo "ERROR: Wrong Extension:" {input} >> {log} exit 1 fi """ |
51 52 53 54 55 | shell: """ python workflow/bgcflow/bgcflow/data/make_prokka_db.py {wildcards.prokka_db} \ {input.table} {output.refgbff} 2>> {log} """ |
67 68 69 70 71 | shell: """ python workflow/bgcflow/bgcflow/data/get_organism_info.py {wildcards.strains} \ "{params.samples_path}" data/interim/assembly_report/ data/interim/prokka/ 2>> {log} """ |
105 106 107 108 109 110 111 112 | shell: """ prokka --outdir data/interim/prokka/{wildcards.strains} --force \ {params.refgbff} --prefix {wildcards.strains} --genus "`cut -d "," -f 1 {input.org_info}`" \ --species "`cut -d "," -f 2 {input.org_info}`" --strain "`cut -d "," -f 3 {input.org_info}`" \ --cdsrnaolap --cpus {threads} {params.rna_detection} --increment {params.increment} \ {params.use_pfam} --evalue {params.evalue} {input.fna} &> {log} """ |
125 126 127 128 129 | shell: """ python workflow/bgcflow/bgcflow/data/format_genbank_meta.py {input.gbk_prokka} \ {params.version} {input.gtdb_json} {wildcards.strains} {output.gbk_processed} 2> {log} """ |
144 145 146 147 148 149 | shell: """ cp {input.gbk} {output.gbk} 2>> {log} cp {input.summary} {output.summary} 2>> {log} cp {input.tsv} {output.tsv} 2>> {log} """ |
13 14 15 16 17 | shell: """ refseq_masher matches --output-type {params.output_type} --top-n-results \ {params.top_n_results} --output {output.masher_csv} {input.fasta} 2>> {log} """ |
14 15 16 17 | shell: """ roary -p {threads} -f {output.roary_dir} -i {params.i} -g {params.g} -e -n -r -v {input.gff} &>> {log} """ |
34 35 36 37 38 | shell: """ mkdir -p {output.eggnog_dir} emapper.py -i {params.faa} --translate --itype "CDS" --excel --cpu {threads} -o {wildcards.name} --output_dir {output.eggnog_dir} --data_dir {input.eggnog_db} &>> {log} """ |
55 56 57 58 59 60 | shell: """ (cd {input.resource} && python tf_running.py \ -i {params.faa} -o {params.outdir} \ -g cpu -cpu {threads}) 2>> {log} """ |
77 78 79 80 81 | shell: """ diamond makedb --in {params.faa} -d {output.diamond_interim} -p {threads} &>> {log} cp {output.diamond_interim} {output.diamond_processed} &>> {log} """ |
95 96 97 98 | shell: """ python workflow/bgcflow/bgcflow/data/make_pangenome_dataset.py {input.roary_interim_dir} {output.roary_processed_dir} {input.automlst_processed_dir} 2>> {log} """ |
12 13 14 15 | shell: """ seqfu stats {input.fna} --json -b --gc --precision 3 > {output.json} 2>> {log} """ |
35 36 37 38 39 40 41 42 43 | shell: """ TMPDIR="data/interim/tmp/{wildcards.name}" mkdir -p $TMPDIR INPUT_JSON="$TMPDIR/df_seqfu.txt" echo '{input.json}' > $INPUT_JSON python workflow/bgcflow/bgcflow/data/make_seqfu_table.py $INPUT_JSON {output.all_csv} &>> {log} rm $INPUT_JSON """ |
Support
- Future updates
Related Workflows





