Large comparative genomic analysis tool for the prediction of lifestyle-associated genes
microLife is a streamlined computational workflow that annotates bacterial genomes and performs large-scale comparative genomics to predict bacterial lifestyles and to pinpoint candidate genes, denominated lifestyle-associated genes (LAGs) , and biosynthetic gene clusters associated with each lifestyle detected. This whole process is divided into different modules:
-
Clustering module Predicts, clusters and annotates the genes of every input genome
-
Lifestyle prediction Employs a machine learning model to forecast bacterial lifestyle or other specified metadata
-
Analitical module (Shiny app) Results from the previous modules are embedded in a user-friendly interface for comprehensive and interactive comparative genomics. An example of the app with a demo dataset (genomes present in
data
) is available at http://178.128.251.24:3838/microLife_linux
How to start using microLife
Download microLife and install dependencies
git clone https://github.com/Carrion-lab/microLife.git microLife cd microLife/
Install the conda environment necessary to download files from NCBI using the file
ENVS/MicroLife_download.yml
and activate it:
mamba env create -f ENVS/MicroLife_download.yml
conda activate MicroLife_download
Download genomes and reduce redundancy
Download the accession names
Retrieve a list of accesion IDs you are interested in processing with microLife (e.g All Mycobacterium RefSeq genomes) by extracting the SQL query (image below) from NCBI website search and integrating it in the following command:
esearch -db assembly -query '("Mycobacterium"[Organism] OR Mycobacterium[All Fields]) AND ("latest refseq"[filter] AND all[filter] NOT anomalous[filter])' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession > NCBI_accs.txt
Download and uncompress the FASTA files
mkdir genomes/
cd genomes/
bit-dl-ncbi-assemblies -w ../NCBI_accs.txt -f fasta -j 10
gzip -d ./*.gz
Download metadata and remove redundancy at >99% Average Nucleotide Identity (ANI)
cd ../
python main_script_download.py
Making sure everything is ready
Before starting microLife, you should have the following in your main directory (
microLife/download/
):
-
"genomes_renamed/": Directory where the non-redundant genomes with the filenames ready for microLife are stored.
-
"METADATA_MERGED.txt": File with the following columns: 'scientific_name', 'NCBIid', 'cluster_membership', 'representative' and 'MicroLife_name'
Remember to deactivate the conda environment
MicroLife_download
if you want to jump to the next section ( "Executing microLife" ) right after this one!
Executing clustering module
Install dependencies necessary for MicroLife
mamba env create -f ENVS/MicroLife_environment.yml
mamba env create -f ENVS/antismash.yml
mamba env create -f ENVS/bigscape.yml
conda activate MicroLife_environment
Prepare input files
The input must be FASTA assemblies stored in the
data/
directory, all with name that should follow the format "
Genus_species_strain_O.fna
". *
If genomes were downloaded as described in the previous section
'Download genomes and reduce redundancy'
, the folder directory
download/genomes_renamed/
already contains the genome files in the correct name format and they only have to be moved to the
data/
folder .
mv download/genomes_renamed/* data/
To run microLife with your own genomes they should follow the format "
Genus_species_strain_O.fna
"* and be stored in the
data/
. Then, you must use the script
src/rename_genomes.R
to change the input genome strain names into a MicroLife friendly format in the following way:
Rscript src/rename_genomes.R data/ names_equivalence.txt
This script changes the name of the input genome files by replacing the strain name into a barcode to avoid microLife crashes
I.e Pseudomonas_fluorescens_FDAARGOS-1088.fna --> Pseudomonas_fluorescens_X0001.fna
The file "names_equivalence.txt" contains the correspondent name matching and needs to be present in the main directory before running microLife. If you downloadede your genomes as described above, the file
names_equivalence.txt
is already present in the main folder.
*In the genome format name, the user must avoid adding any special characters, that is not a alphabet letter, number , dash or dot. This applies specially to the strain name which also should be of less than 10 characters long.
Executing Snakemake
microLife is written using the Snakemake workflow manager and it can be executed using the following command from the main directory
snakemake -j 24
-j
specifies the number of CPU cores to use for the whole process. If this flag is ommitted, Snakemake will use all the available CPU cores in the machine
Generate metadata
To be able to run the Lifestyle prediction module and the Shiny app (next modules of the microLife pipeline) you have to modify the
mapping_file.txt
file. This file is generated at the end of the clustering module. It consists of a two column tab-separated file where the first column specifies the genome name (as established in the
data/
directory) and the second column specifies the Lifestyle of that genome. Every genome lifestyle is annotated as 'Unknown' when the file is generated. User have to manually annotate the genomes with their information of interest. Genomes with no lifestyle information must be annotated as "Unknown". Note that additional columns can be added to the file with other metadata information.
microLife outputs
Two primary outputs, namely
MEGAMATRIX_renamed.txt
and
big_scape_binary_table_renamed.txt
, are generated in the main folder.
MEGAMATRIX_renamed.txt
is a matrix where each row represents a gene cluster, and each column represents the presence of these clusters in different genomes, along with their annotations from different databases. Similarly,
big_scape_binary_table_renamed.txt
is a matrix where each row represents a Gene Cluster Family (GCF) or a cluster of Biosynthetic Gene Clusters (BGCs), and each column represents the presence of these clusters in different genomes.
Intermediate outputs are stored in the intermediate_files/ directory. Within this folder, the annot/ subdirectory contains the prokka outputs for each genome. These outputs include annotation files such as
.gbk
,
.faa
,
.gff
, and
.ffn
, among others. Similarly, the
antismash/
folder contains the antiSMASH results for each genome.
Lifestyle prediction module
The user can test the predictability of their metadata with the machine learning model random forest using the script
src/classifier.R
and specifying the location of
mapping_file.txt
and
MEGAMATRIX_renamed.txt
in addition to the column name of the metadata variable that the user wants to predict
Rscript classifier_src/classifier.R mapping_file.txt MEGAMATRIX_renamed.txt Lifestyle
ROC plots showing the model evaluation for the different classes present in the metadata are generated and stored in
classifier/
If good accuracy is accomplished, the user can use the model for predictions of genomes labeled as 'Unknown' in the mapping file. The resulting augmented metadata is stored in the new file
mapping_file_augmented.txt
microLife App (Analytical module)
In order to initiate the shiny app, the following snakemake output files need to be droped in the 'Shiny_app/input/' directory.
-
MEGAMATRIX_renamed.txt
-
mapping_file.txt
ormapping_file_augmented.txt
-
intermediate_files/BiG-SCAPE/big_scape_binary_table_renamed.txt
-
intermediate_files/BiG-SCAPE/annotation.txt
-
intermediate_files/combined_proteins/combined_proteins.fasta
-
names_equivalence.txt
After having prepared these input files, users should move the working directory to
Shiny_app/
and open the script
app.R
in Rstudio and click in the upper right '
run app
' button to initiate the microLife app and enjoy visualizing all their results!
Note that user may move the
Shiny_app/
folder and all its content to a local computer and lunch Rstudio locally.
Code Snippets
64 65 66 | run: shell("mkdir -p {params}") shell("touch .mkdir.chkpnt") |
84 85 | run: shell('prokka --force --outdir {params.outdir} --prefix {params.prefix} --locustag {params.str} --addgenes --increment 5 --centre NIOO-KNAW --genus {params.genus} --species {params.species} --str {params.str} --gcode 11 --cpus 20 --evalue 1e-03 --rfam {input.file}') |
95 96 | shell: 'python ./src/genbank_to_protein_fasta.py --genbank {input} --fasta {output.faa}' |
108 109 110 | run: shell('cat {input.faa} > {output.orig}') shell('cat {input.tags} > {output.tags}') |
119 120 | run: shell('python ./src/rename_proteins.py -f {input.orig} -t {input.tags} -o {output.combined_proteins} -n {output.newtagfile}') |
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | run: #Extract gene lengths shell("bioawk -c fastx '{{ print $name, length($seq) }}' < intermediate_files/combined_proteins/combined_proteins.fasta >intermediate_files/combined_proteins/length_genes.txt") #Run mmseq2 clustering to 0.95 shell('mmseqs createdb {input} {params.mmseq_db}') shell('mmseqs cluster {params.mmseq_db} {params.mmseq_db_clu} {params.mmseq_temp} --min-seq-id 0.95 --cov-mode 0 -c 0.8') shell('mmseqs createtsv {params.mmseq_db} {params.mmseq_db} {params.mmseq_db_clu} {params.mmseq_tsv}') #Extract mmseq2 representatives shell('mmseqs createsubdb {params.mmseq_db_clu} {params.mmseq_db} {params.mmseq_rep}') shell('mmseqs convert2fasta {params.mmseq_rep} {params.mmseq_fasta}') #Run diamond shell('diamond makedb --in {params.mmseq_fasta} -d {params.diamond_db}') shell('diamond blastp -b 10 -p {THREADS} -c 1 -e 0.00001 --un {params.diamond_unaligned} -k 5000 -d {params.diamond_db} -q {params.mmseq_fasta} -o {params.diamond_results} --outfmt 6 qseqid sseqid pident bitscore') shell("awk -F'\t' '$3>20' {params.diamond_results} > {params.diamond_results_filtered}") shell("cut -f1,2,4 {params.diamond_results_filtered} > {params.diamond_results_clean}") ##Get unaaligned fasta headers shell('grep "^>" {params.diamond_unaligned} > {params.diamond_unaligned_headers}') ### Run MCL shell('mcxload -abc {params.diamond_results_clean} --stream-mirror -o {params.mcl_data} -write-tab {params.mcl_tab}') shell('mcl {params.mcl_data} -I 3.0 -use-tab {params.mcl_tab}') shell('mv out.mcl_data.mci.I30 intermediate_files/clustering/') ##Generate matrix shell('Rscript src/MCL_merge.R {params.mmseq_tsv} {params.mcl_clusters} {params.diamond_unaligned_headers} {output.binary_table} {input} {output.fasta}') |
189 190 191 | run: shell("grep '^>' {input.faa} > {params.headers}") shell("Rscript ./src/extract_prokka.R {input.id2tag} {params.headers} {output} ") |
203 204 205 206 207 | run: shell('wget -P ./intermediate_files/PFAM/ ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/Pfam-A.hmm.gz') shell('gunzip intermediate_files/PFAM/Pfam-A.hmm.gz' ) shell('hmmpress intermediate_files/PFAM/Pfam-A.hmm') shell('hmmsearch --tblout {output.pfam} --cpu 30 -E 1e-5 ./intermediate_files/PFAM/Pfam-A.hmm {input}') |
216 217 218 | run: shell('download_eggnog_data.py -y --data_dir {params.data}') shell('emapper.py -i {input} --cpu 25 -o intermediate_files/eggnog_annotation/eggnog_annotation --data_dir {params.data} --pident 30 --query_cover 50 --subject_cover 50 --report_orthologs') |
231 232 233 234 | run: shell("sed '/^#/d' {input} > {output.clean_annotation}") shell('Rscript src/COG_KEGG_annotations.R {output.clean_annotation} {params.cog_descriptions} {params.kegg_descriptions} {output.cog} {output.kegg}') shell("sed '/^ko/d' {output.kegg} > {output.kegg_clean}") |
243 244 245 | run: shell('wget -P ./intermediate_files/DBCAN/ http://bcb.unl.edu/dbCAN2/download/dbCAN-HMMdb-V9.txt') shell('hmmsearch --tblout {output} -E 1e-5 ./intermediate_files/DBCAN/dbCAN-HMMdb-V9.txt {input}') |
255 256 | run: shell("Rscript ./src/Process_hmm_annotation.R {input.dbcan} {input.pfam} {params.dbcan_family} {params.pfam_family} {output}") |
268 269 | run: shell("set -euo pipefail; Rscript ./src/Process_annotations.R {input.matrix} {input.cog} {input.kegg} {input.hmm_annotation} {input.prokka} {output}") |
281 282 | shell: 'set +u; source /opt/miniconda3/etc/profile.d/conda.sh; conda activate antismash_microlife; set -u; antismash --cb-general --cb-knownclusters --cb-subclusters --output-dir {params.out_dir} -c {params.threads} --asf --pfam2go --genefinding-tool prodigal --smcog-trees {input}' |
290 291 292 293 294 295 | run: shell('git clone https://github.com/medema-group/BiG-SCAPE.git') shell('mv BiG-SCAPE intermediate_files/') shell('rm intermediate_files/BiG-SCAPE/bigscape.py') shell('cp src/bigscape.py intermediate_files/BiG-SCAPE/') |
311 312 | shell: "set +u; source /opt/miniconda3/etc/profile.d/conda.sh; conda activate bigscape_microlife; set -u; python ./intermediate_files/BiG-SCAPE/bigscape.py -i {params.indir} -o {params.outdir} --pfam_dir intermediate_files/PFAM/ --mode glocal --mibig --cutoffs 0.3 0.7 --include_singletons --cores {params.threads} --mix" |
329 330 331 332 | run: shell("Rscript src/I-BIGSCAPE_revision.R {input.clustering} {input.network} {input.annotations} intermediate_files/BiG-SCAPE/mix_filtered.network intermediate_files/BiG-SCAPE/GCF_annotation.txt {output.annotations}") shell("python src/II_Absence_Presence.py intermediate_files/BiG-SCAPE/GCF_annotation.txt intermediate_files/BiG-SCAPE/abs_pres_table.csv ") shell("Rscript src/III_Absence_Presence_GCF.R intermediate_files/BiG-SCAPE/abs_pres_table.csv {output.binary_matrix}") |
346 347 | run: shell('Rscript src/rename_MEGAMATRIX.R {input.genes} {input.BGCs} {params} {output.genes} {output.BGCs} {output.mapping_file}') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | args = commandArgs(trailingOnly=TRUE) ##arg[1] is the annotation file from emapper ##arg[2] is the cog descriptions ##arg[3] is the kegg descriptions ##arg[4] is the cog output ##arg[5] is the kegg output annotation <- read.delim2(args[1], header = F) annotation <-annotation[,c(1,5,12)] colnames(annotation) <- c('id', 'cogid', 'keggid') #Extract COGs cog_annotation <- annotation[,c(1,2)] keep <- grep("COG", cog_annotation$cogid) cog_annotation <-cog_annotation[keep,] cog_annotation$cogid <- sapply(strsplit(cog_annotation$cogid,"@"), `[`, 1) keep <- grep("ar", cog_annotation$cogid, invert = T) cog_annotation <-cog_annotation[keep,] ##Load COG descriptions cog_extended_annotations <- read.csv(args[2], header = F) cog_extended_annotations <- cog_extended_annotations[,c(1,3)] colnames(cog_extended_annotations)[1] <- 'id' colnames(cog_extended_annotations)[2] <- 'cog_description' cog_annotation<- merge(cog_annotation, cog_extended_annotations, by.x = 'cogid', by.y = 'id', all.x = T) cog_annotation<- cog_annotation[,c(2,1,3)] ##TO WRITE colnames(cog_annotation)[1]<- 'gene' write.table(cog_annotation, args[4], row.names = F) #Extract KEGGs kegg_annotation <- annotation[,c(1,3)] keep <- grep("ko", kegg_annotation$keggid) kegg_annotation <-kegg_annotation[keep,] kegg_annotation$keggid <- sapply(strsplit(kegg_annotation$keggid,","), `[`, 1) kegg_annotation$keggid <- sapply(strsplit(kegg_annotation$keggid,":"), `[`, 2) ##KO descriptions ko_descriptions <- read.delim2(args[3], header = F) ko_descriptions$V1 <- sapply(strsplit(ko_descriptions$V1,":"), `[`, 2) colnames(ko_descriptions)<- c('keggid', 'kegg_description') kegg_annotation <- merge(kegg_annotation, ko_descriptions, by= 'keggid', all.x = T) kegg_annotation <-kegg_annotation[,c(2,1,3)] colnames(kegg_annotation)[1]<- 'gene' write.table(kegg_annotation, args[5], row.names = F) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | library(stringr) library(dplyr) args = commandArgs(trailingOnly=TRUE) #args[1] is the id2tags input #args[2] is the prokka_headers.txt input #args[3] is the output id2tags<- read.delim2(args[1], header = F) id2tags<- id2tags[,1:2] colnames(id2tags)<- c('geneid', 'gene') prokka_header <- read.delim(args[2], header = F) prokka_header$prokkadescription <- prokka_header$V1 prokka_header$remove <- sapply(strsplit(prokka_header$V1," "), `[`, 1) prokka_header$prokkadescription <- str_remove(prokka_header$prokkadescription, prokka_header$remove) prokka_header$prokkadescription <- sub('.', '', prokka_header$prokkadescription) prokka_header$remove <- sub('.', '', prokka_header$remove) prokka_header <- prokka_header[,c(3,2)] colnames(prokka_header)[1] <- 'gene' ##Merge id2tags with prokka headers prokka_annotation <- merge(id2tags, prokka_header, by= 'gene', all.x = T) prokka_annotation <- prokka_annotation[,2:3] prokka_annotation<- prokka_annotation %>% distinct(geneid, .keep_all = TRUE) write.table(prokka_annotation, args[3], row.names = F) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | import argparse, os import Bio from Bio import SeqIO def genbank2faa(infile, outfile): seq = SeqIO.parse(infile,'genbank') fastaseq = [] if not os.path.exists(outfile): print( "Converting :%s"%(infile) ) #partial CDS_counter partial = 0 id_2_tags = "%s.tags"%(outfile) tagfile = open(id_2_tags,'w') for s in seq: features = s.features for f in features: if f.type == "source": sourcefeature = f if 'organism' in sourcefeature.qualifiers: organism = sourcefeature.qualifiers['organism'][0] strain = '' if 'strain' in sourcefeature.qualifiers: strain = sourcefeature.qualifiers['strain'][0] strain = strain.replace(' ','').replace(':','').replace('|','') organism = organism + '_' + strain #else: # organism = s.annotations['organism'].replace(' ','_').replace(':','_').replace('|','_') organism = organism.replace(' ','_').replace(':','_').replace('|','_') ltprefix = estimate_locustag_prefix(features) ltprefixnew = "%s_"%(ltprefix) moltype="C" for f in features: if f.type == "source": if "plasmid" in f.qualifiers: moltype="P" cdscount=0 for f in features: if f.type == "CDS": cdscount = cdscount + 1 if 'locus_tag' in f.qualifiers: locus_tags = f.qualifiers['locus_tag'] locus_tag = locus_tags[-1] if not '_' in locus_tag: locus_tag = locus_tag.replace(ltprefix, ltprefixnew) if not isinstance(f.location.start, Bio.SeqFeature.ExactPosition): # location is not exact. partial = partial +1 locus_tag = "%s_%02d"%(locus_tag, partial) prot = f.extract(s) prot.seq = prot.seq.translate(table=11, to_stop=True) prot.id = locus_tag tagfile.write(locus_tag) tagfile.write("\t") tagfile.write(moltype) tagfile.write("\t") tagfile.write(organism) tagfile.write("\t") tagfile.write(prot.description) tagfile.write("\n") if 'product' in f.qualifiers: prot.description = f.qualifiers['product'][0] else: prot.description = 'unknown' fastaseq.append(prot) print( "Written :%s sequences from %s"%(len(fastaseq), s.description) ) if partial>0: print("Warning sequence %s contains %d partial CDS sequences. This might inflate the number of clusters."%(infile, partial)) SeqIO.write(fastaseq, outfile,'fasta') else: print( "Not Converting :%s"%(infile) ) print( "%s exists"%(outfile) ) def genbank2ffn(infile, outfile): seq = SeqIO.parse(infile,'genbank') fastaseq = [] print( "Converting :%s"%(infile) ) #partial CDS_counter partial = 0 for s in seq: features = s.features for f in features: if f.type == "gene": if 'locus_tag' in f.qualifiers: locus_tags = f.qualifiers['locus_tag'] locus_tag = locus_tags[-1] if not isinstance(f.location.start, Bio.SeqFeature.ExactPosition): # location is not exact. partial = partial +1 locus_tag = "%s_%02d"%(locus_tag, partial) gene = f.extract(s) gene.id = locus_tag if 'gene' in f.qualifiers: gene.description = f.qualifiers['gene'][0] else: gene.description = gene.id fastaseq.append(gene) print( "Written :%s sequences from %s"%(len(fastaseq), s.description) ) if partial>0: print("Warning sequence %s contains %d partial CDS sequences. This might inflate the number of clusters."%(infile, partial)) SeqIO.write(fastaseq, outfile,'fasta') def estimate_locustag_prefix(features): strings = [f.qualifiers['locus_tag'][-1] for f in features if 'locus_tag' in f.qualifiers] """ Find the longest string that is a prefix of all the strings. """ if not strings: return '' prefix = strings[0] for s in strings: if len(s) < len(prefix): prefix = prefix[:len(s)] if not prefix: return '' for i in range(len(prefix)): if prefix[i] != s[i]: prefix = prefix[:i] break return prefix if __name__=='__main__': parser = argparse.ArgumentParser() parser.add_argument('-g', '--genbank', dest = 'genbank', help = 'genbank input file', required = True) parser.add_argument('-f', '--fasta', dest = 'fasta', help = 'fasta CDS output file', required = True) args = parser.parse_args() genbank2faa(args.genbank, args.fasta) |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | 'R Script for Data Mining and Manipulation of Output Tables from BiG-SCAPE' 'Libraries to be Imported' library(plyr) library(dplyr) library(stats) library(tidyverse) library(ggplot2) #Set Working Directory #setwd('/Results_I') #ex:setwd("E:/NACTAR") args = commandArgs(trailingOnly=TRUE) "Import Data Tables from BiG-SCAPE" #Clustering Tables clustering_df_mix <- read.delim(args[1], header = TRUE) #clustering dataframe #Network Tables network_df <- read.delim(args[2], header = TRUE) #general network dataframe network_df <- network_df [-c(4:11)] #removal of unwanted columns #Annotation Tables annotations_df_full<- read.delim(args[3], header = TRUE) #annotations dataframe 'Merge Files for Absence Presence Table' annotations_merged_df_mix <- merge(annotations_df_full , clustering_df_mix , by.x= "BGC", by.y = "X.BGC.Name") #merge clustering and annotations annotations_merged_df_mix <- annotations_merged_df_mix [, c(1,8,6,5)] #annotations_merged_df_mix <- annotations_merged_df_mix [, c(1,8,4,5,6)] #removal of unwanted columns colnames(annotations_merged_df_mix) <- c('BGC Name', 'GCF No', 'Organism', 'BGC Class') 'Extract Rows based on Logical Criteria (User Based Criteria) (Optional by User)' #sum(network_df_mix$Raw.distance > 0.1) network_df__distance_filtered <- network_df %>% filter(Raw.distance > 0.10) network_df_bgc_filtered <- network_df[!grepl("BGC", network_df$Clustername.1),] #Export Tables to CSV Format # Write data to txt file: tab separated values # sep = "\t" write.table(network_df__distance_filtered, file = args[4], sep = "\t", row.names = FALSE, quote = FALSE) # Write data to csv files: # decimal point = "." and value separators = comma (",") write.table(annotations_merged_df_mix, file = args[5], sep = '\t', row.names = FALSE, quote = FALSE) ###Obtain GCF annotations list merged_annotations <- annotations_merged_df_mix merged_annotations$GCF.No <- paste0( 'GCF',merged_annotations$GCF.No) merged_annotations<- merged_annotations[,c(2,4)] merged_annotations <- distinct(merged_annotations) write.csv(merged_annotations, args[6], row.names = F) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | 'Python Script used to create an absence presence table of GCFs per Strain from BiG-SCAPE file' #Import Libraries import pandas as pd import sys import os from string import digits #Importing DataFrame df1 = pd.read_csv(sys.argv[1], sep="\t") df1.columns = ['BGC Name', 'GCF No', 'Organism', 'BGC Class'] #print(df1.isnull()) #print(df1.head()) 'Method for Creating Absene Presence Table' #Dataframe Copies df2 = df1.copy() #Copy Original DataFrame df2.columns = ['BGC Name', 'GCF No', 'Organism', 'BGC Class'] #DataFrame Manipulation df2['GCF No'] = df1['BGC Class'].astype(str) + 'GCF' + df2['GCF No'].astype(str) df1[['BGC','BGC2']] = df1['BGC Name'].str.split('_',n=1,expand=True) df2['Genome'] = df1['Organism'].astype(str) + df1['BGC Name'].astype(str) df2.set_index('GCF No') df2 = df2[['GCF No','Genome']] df2['Genome'] = df2['Genome'].str.replace(' ', '_').str.replace('Unclassified','_').str.replace('__','_').str.replace('.','') path = os.path.join(os.path.abspath(os.getcwd()), 'intermediate_files/BiG-SCAPE') df2.to_csv(os.path.join(path,r'abs_pres_table.csv')) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | args = commandArgs(trailingOnly=TRUE) 'R Script for Creation of BGC absence presence per genome from Absence_Presence Table from BiG-SCAPE' # ----------------------------------------------------------------------------- # Absence Presence Matrix Creation # Abs_Pres Table Created with Python Script () # ----------------------------------------------------------------------------- #Creation of Co_Ocurrence Matrix 'Clustering Matrix With BGC from MIBIG' abs_pres_matrix <- read.delim(args[1], header= TRUE, sep=",") abs_pres_matrix <-abs_pres_matrix[, c(2,3)] #removal of first column #Modify GCF.No and Genome Columns abs_pres_matrix <- abs_pres_matrix[!grepl("BGC", abs_pres_matrix$Genome),] abs_pres_matrix$Genome<- gsub("BGC", "_BGC", abs_pres_matrix$Genome) abs_pres_matrix$GCF.No<- gsub("[a-zA-Z ]", "", abs_pres_matrix$GCF.No) #Removes BiG-SCAPE Class from GCF Column abs_pres_matrix$GCF.No<- gsub("-_", "", abs_pres_matrix$GCF.No) abs_pres_matrix$GCF.No <- sub("^", "GCF", abs_pres_matrix$GCF.No) #Appends GCF to the Front of the Number #Separating Columns for Renaming abs_pres_matrix$genome2 <- sapply(strsplit(as.character(abs_pres_matrix$Genome),"_"), `[`, 1) abs_pres_matrix$genome3 <- sapply(strsplit(as.character(abs_pres_matrix$Genome),"_"), `[`, 2) abs_pres_matrix$genome4 <- sapply(strsplit(as.character(abs_pres_matrix$Genome),"_"), `[`, 3) abs_pres_matrix$BGC.Region <- sapply(strsplit(as.character(abs_pres_matrix$Genome),"_"), `[`, 4) #Merging Previous Columns abs_pres_matrix$Genome <- paste0(abs_pres_matrix$genome2, "_", abs_pres_matrix$genome3, "_", abs_pres_matrix$genome4) #Removal of Final Columns abs_pres_matrix<- abs_pres_matrix[, -c(3,4,5,6)] binary_table <- table(abs_pres_matrix) binary_table[binary_table>0] <- 1 binary_table<- as.data.frame.matrix(binary_table) columns <- colnames(binary_table) new_columns <- paste0(sapply(strsplit(columns,"_"), `[`, 1),'_', sapply(strsplit(columns,"_"), `[`, 2), '_', sapply(strsplit(columns,"_"), `[`, 3)) colnames(binary_table) <- new_columns write.table(binary_table,args[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | library(reshape2) library(tidyverse) library(stringr) library(dplyr) library(readr) library(Biostrings) library(data.table) library(parallel) cl <- makeCluster(30) args = commandArgs(trailingOnly=TRUE) tsv = data.table(read_table(args[1], col_names = F)) colnames(tsv) <- c('V1', 'V2') n_clusters_mmseq = length(unique(tsv$V1)) tsv2 <- tsv[, .( V2 = str_c(V2, collapse = ',') ), by = V1] #write.table(tsv2,'diamond/mmseq2_clusters.txt', row.names = F, quote = F) 'Open MCL output, collapse it and clean it\n' mcl <- data.table(read_csv( args[2], col_names = F)) #Load unaligned headers unalin <- read.delim2(args[3], header = F) unalin <- data.frame(X1 = str_remove_all(unalin$V1, '>')) mcl <- dplyr::bind_rows(mcl, unalin) #Name clusters mcl$clusters <- seq(nrow(mcl)) mcl$clusters <- sprintf("cluster_%06d", mcl$clusters) mcl_melt <- mcl[, c(X1 = strsplit(X1, "\t")), by = clusters] 'Join mmseq2 clusters with MCL clusters\n' M <- merge(mcl_melt, tsv2, by.x= 'X1', by.y = 'V1') M <- M[,c(2,3)] colnames(M)[2] <- 'descriptions' M <- data.table(M) M <- M[, .( descriptions = str_c(descriptions, collapse = ',') ), by = clusters] M <- M[order(clusters)] M$gene <- sapply(strsplit(M$descriptions,","), `[`, 1) #write.table(M, 'clusters_info.txt', row.names = F, sep = '\t', quote = F) 'Create 1/0 matrix\n' bacteria_name <- function(string){ b <- stringr::str_split(string, ',')[[1]] c <- sapply(stringr::str_split(b,"[|]"), `[`, 1) d <- stringr::str_c(c, collapse = ",") return(d) } MEGAMATRIX <- M MEGAMATRIX$descriptions <- parSapply(cl, MEGAMATRIX$descriptions , FUN = bacteria_name) MEGAMATRIX$gene <- NULL tbl <- table(splitstackshape:::cSplit(MEGAMATRIX, "descriptions", sep = ",", direction = "long")) tbl <- data.table(as.data.frame(tbl)) tbl <-dcast.data.table(tbl, clusters ~ descriptions, value.var = 'Freq') #lista <- as.list(MEGAMATRIX$descriptions) #binary_matrix <- as.data.frame((splitstackshape:::charMat(listOfValues = lista, fill = 0L))) #binary_matrix <- cbind(clusters = MEGAMATRIX$clusters, binary_matrix) binary_matrix <- merge(tbl, M, by = 'clusters') ####Extract sequence of representative proteins #system("bioawk -c fastx '{ print $name, length($seq) }' < intermediate_files/combined_proteins/combined_proteins.fasta >intermediate_files/combined_proteins/length_genes.txt") gene_lengths <- read.table('intermediate_files/combined_proteins/length_genes.txt', header = F) ##Function to extract longest gene extract_long_genes <- function(character, gene_length){ a <- as.character(stringr::str_split(character, ',')[[1]]) sub <- data.table::data.table(gene_length[gene_length$V1 %in% a,]) longest_gene <- sub[which.max(sub$V2)]$V1 return(longest_gene) } 'Obtain representative genes\n' genes_description <- binary_matrix$descriptions new_rep <- parSapply(cl, genes_description, extract_long_genes, gene_length = gene_lengths) binary_matrix$gene <- new_rep write.table(binary_matrix, args[4], row.names = F) 'Obtain representative sequences\n' representative_genes <- binary_matrix$gene fasta <- Biostrings::readAAStringSet(args[5]) subset <- fasta[representative_genes] writeXStringSet(subset, args[6]) |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | args = commandArgs(trailingOnly=TRUE) ##arg[1] is the matrix from cdhit ##arg[2] is the cog annotatuion ##arg[3] is the kegg annotatuion ##arg[4] is the hmm annotation ##arg[5] is the clusterVSid file ##arg[6] is the prokka annotations ##arg[7] is the output name of matrix with all annotations 'Reading input files' matrix <- read.table(args[1], header = T) clustervsgene <- matrix[,c('clusters', 'gene')] ###COG annotations- cog <- read.table(args[2], header = T) ####KEGG annotations kegg <- read.table(args[3], header = T) ###Prokka annotations prokka <- read.table(args[5], header = T) 'Merge annotations COG/KEGG' ###Merge annotations all_annotations <- merge(kegg, cog, by= 'gene', all = T) 'Merge with HMM annotations' ##Merge with hmm annotations hmm_annotations <- read.table(args[4], header = T) all_annotations <- merge(all_annotations, hmm_annotations, by= 'gene', all = T) 'Merge with prokka annotation' ##Merge with prokka all_annotations <- merge(all_annotations, prokka, by.x= 'gene', by.y = 'geneid', all = T) ###Merge with absence_presence_matrix 'Merge with absence_presence_matrix' all_annotations <- merge(clustervsgene, all_annotations, by = 'gene', all=T) all_annotations$gene<- NULL column_number= ncol(matrix) matrix_info = matrix[,c(column_number -1 , column_number )] matrix <- matrix[1:(column_number-2)] matrix$completeness <- 'True' matrix <- cbind(matrix, matrix_info) MEGAMATRIX <- merge(matrix, all_annotations, by= 'clusters', all.x=T) 'Write MEGAMATRIX' write.table(MEGAMATRIX, args[6], row.names = F) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | library(dplyr) library(stringr) source('src/rhmmer_functions.R') #arg[1] is the dbcan annotation #arg[2] is the pfam annotation #arg[3] is the dbcan family info #arg[4] is the pfam family info #arg[5] is the output table with both annotations args = commandArgs(trailingOnly=TRUE) ###Extract top hits function top_hits <- function(table){ genes <- unique(table$domain_name) top_hits <- table[0,] for (i in 1:length(genes)){ subtable <- table[table$domain_name == genes[i],] min_evalue <- subtable[which.min(subtable$sequence_evalue),] top_hits <- rbind(top_hits,min_evalue) } return(top_hits) } ##DBCAN annotations dbcan <- read_tblout(args[1]) dbcan <- top_hits(dbcan) dbcan$query_name <- sapply(strsplit(dbcan$query_name,"_"), `[`, 1) dbcan$query_name <- str_remove(dbcan$query_name, '.hmm') dbcan_description <- read.delim(args[3], header = F) colnames(dbcan_description)[1]<- 'dbcan_family' colnames(dbcan_description)[2]<- 'dbcan_description' dbcan_final <- merge(dbcan[,c(1,3)], dbcan_description, by.x = 'query_name', by.y = 'dbcan_family', all.x=T) colnames(dbcan_final)[1]<- 'dbcanid' colnames(dbcan_final)[2]<- 'gene' ##Group different annotations of same gene dbcan_final <- dbcan_final %>% group_by_at(vars(gene)) %>% summarize_all(paste, collapse=",") ###Check problems of family names with a _ ##PFAM annotations pfam <- read_tblout(args[2]) pfam <- top_hits(pfam) pfam <- pfam[,c(1,4)] pfam$query_accession <- sapply(strsplit(pfam$query_accession,"[.]"), `[`, 1) pfam_description <- read.delim2(args[4], header = F) pfam_description <-pfam_description[,c(1,5)] colnames(pfam_description)[1] <- 'query_accession' colnames(pfam_description)[2] <- 'pfam_description' pfam_final <- merge(pfam, pfam_description, by = 'query_accession', all.x = T) colnames(pfam_final)[1]<- 'pfamid' colnames(pfam_final)[2]<- 'gene' ##Group different annotations of same gene pfam_final <- pfam_final %>% group_by_at(vars(gene)) %>% summarize_all(paste, collapse=",") ###Merge both annotations in one table hmm_annotations <- merge(pfam_final, dbcan_final, by= 'gene', all = T) write.table(hmm_annotations, args[5], row.names=FALSE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | library(stringr) library(readr) args = commandArgs(trailingOnly=TRUE) matrix <- read_delim(args[1], col_names = T, quote = "\"") matrix <- as.data.frame(matrix) names_equivalence <- read.table(args[3], header = T) names_equivalence$Full_name <- str_remove(names_equivalence$Full_name, '_O.fna') names_equivalence$MicroLife_name <- str_remove(names_equivalence$MicroLife_name, '_O.fna') n_samples <- grep("completeness", colnames(matrix)) - 1 old_colnames <- data.frame(MicroLife_name = colnames(matrix)[2:n_samples] ) M <- merge(old_colnames, names_equivalence, by = 'MicroLife_name') M <- M[match(M$MicroLife_name,old_colnames$MicroLife_name),] colnames(matrix)[2:n_samples] <- M$Full_name write.table(matrix, args[4], row.names = F) ###Rename bigscape table big_scape_matrix <- read.table(args[2], header = T) old_colnames <- data.frame(MicroLife_name = colnames(big_scape_matrix) ) M <- merge(old_colnames, names_equivalence, by = 'MicroLife_name') M <- M[match(M$MicroLife_name,old_colnames$MicroLife_name),] colnames(big_scape_matrix) <- M$Full_name write.table(big_scape_matrix, args[5], row.names = T) ###Create mapping_file mapping_file = data.frame(Sample = names_equivalence$Full_name, Lifestyle = 'Unknown') write.table(mapping_file, args[6], row.names = F) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | import fnmatch import os, glob from Bio import SeqIO import sys, traceback import argparse def rename_fasta(infile, tagfile, outfile, newtagfile): counters = {} tags = {} newseqs=[] t = open(tagfile,'r') tn = open(newtagfile,'w') for line in t: data = line.strip().split('\t') tags[data[0]] = data seqs = SeqIO.parse(infile, 'fasta') for s in seqs: data = tags[s.id] counter = counters.get(data[2], 0) counter = counter + 1 idtag = "%s|%06d"%(data[2], counter) counters[data[2]]=counter s.id = idtag s.description = "" tn.write(idtag) tn.write('\t') tn.write('\t'.join(data)) tn.write('\n') newseqs.append(s) tn.close() try: SeqIO.write((newseqs), outfile, 'fasta') except: print("Exception in user code:") print("-"*60) traceback.print_exc(file=sys.stdout) print("-"*60) if __name__=='__main__': parser = argparse.ArgumentParser() parser.add_argument('-f', '--fasta', dest = 'fasta', help = 'fasta input file', required = True) parser.add_argument('-t', '--tags', dest = 'tags', help = 'tags', required = True) parser.add_argument('-o', '--outfile', dest = 'outfile', help = 'outfile', required = True) parser.add_argument('-n', '--newtags', dest = 'newtags', help = 'tags', required = True) args = parser.parse_args() rename_fasta(args.fasta, args.tags, args.outfile, args.newtags) |
Support
- Future updates
Related Workflows





