This is a phylogeny workflow for batch submission to HPC cluster built with Slurm queue manager
PHACT
PHACT, PH ylogeny- A ware C omputation of T olerance (for amino acid substitutions), is a computational framework based on Snakemake workflow to predict the functional effects of missense mutations.
The proposed approach exploits the phylogenetic tree information to measure the deleteriousness of a given variant. Basically, the workflow is designed by considering not only amino acid frequency, also evolutionary relationship between species, physicochemical properties of amino acids and dependence and independence of substitutions.
The workflow of PHACT for given queries as an example to predict the effect of missense mutations.
First time setup
The following steps are required to run PHACT workflow
-
Install conda & snakemake without super user right permission
-
Configuration of workflow (see below)
-
Set model parameters (see below)
-
Check PHACT workflow (see below)
Genetic database
Configuration
The configuration file (config/config.yml) , a dictionary of configuration parameters and their values, is prepared to make the workflow more flexible. The following global parameters should be set as following.
-
workdir indicates the working directory. After cloning the repository, the full path, for example,
/home/emrah/PHACT
, should be set properly. -
qeury_fasta is a list of query files in fasta format, for example,
["O95153", "O60312"]
that workflow will analyze. -
archived_path , an optional parameter, to store all output for completed analyses into specified folder if proper bash script available in the repository is used.
Meanwhile, which tools and versions are used during the performing task are defined a set of environment yaml files available under
workflow/envs
folder. Each file has prepared for conda package manager to install and setup environment based on definition.
Model parameters
PHACT framework is based on a snakemake workflow, which is defined by specifying rules in Snakefile (
workflow/Snakefile
). Rules decompose the workflow into small steps such as finding homologues of each query sequence (PSI-BLAST), performing multiple sequence alignment (MAFFT) or generating a maximum-likelihood phylogenetic tree (RAXML-NG, FASTTREE).
Each rule has its own model parameters which can be set via single configuration file (
config/config.yml
). For example, the selected method for multiple sequence alignment is set as following.
#alignment
mafft_method: "-fftns" #if left empty it will be FFT-NS-2
Test workflow
To check if the workflow is defined properly and to estimate the amount of calculation remaining, dry-run parameter is used as following.
$ cd workflow
$ snakemake --dry-run
It basically summarizes the number of total job (rule) performed, and sets of input and output files used and created respectively. For given 2 query files in this example, 29 jobs will be executed as they are listed below.
Job counts: count jobs 1 all 2 fast_tree 2 get_blasthits 2 iqtree_anc_score 2 iqtree_ancestral 2 ml_tree 2 msa 2 psiblastp 2 query_fasta 2 raxml_anc_score 2 raxmlng_ancestral 2 remove_gaps 2 trim_msa 2 unroot_fasttree 2 unroot_tree 29
Running PHACT workflow
The simplest way to run PHACT workflow in a local computer is as following.
-
Clone the PHACT repository and cd into it.
$ git clone https://github.com/CompGenomeLab/PHACT.git
$ cd PHACT
-
Modify workdir and query_files parameters in
config/config.yml
to be path the working directory and query id's$ pwd
(gives the working directory) -
Prepare the environment for snakemake
$ conda activate snakemake
-
Run the workflow and follow the executed rules during analyses
snakemake -j 1 --use-conda
PHACT output
The outputs will be in a subfolder of results with their query_id prefix. Each query id folder has its own blast, multiple sequence alignment, maximum likelihood phylogenetic tree, ancestral sequence reconstruction and their computing score. The result/query_id directory will have the following structure:
.
├── 1_psiblast
├── 2_msa
├── 3_mltree
├── 4_raxmlng_ancestral
├── 5_raxmlng_ancestral_scores
├── 6_fasttree
├── 7_iqtree_ancestral
└── 8_iqtree_ancestral_scores
8 directories
Scaling the pipeline across a HPC (farm)
In addition to allow workload running on a local computer with limited number of query id's, PHACT framework is designed to analyze a bulk of query id's in a parallel way using High Performance Computing center.
Most HPC clusters has a scheduler that handles which workload are run on which compute nodes. To interact with a scheduler, users typically need to prepare a bash script and then submit it to cluster. Snakemake has a functionality to perform all these efforts automatically.
PHACT framework, based on Snakemake, has a number of profile files prepared for SLURM scheduler which is mostly used nowadays. The slurm profile files (
config/slurm_truba
or
config/slurm_sabanci
) indicate the fundamental definitions for executing workflow on HPC with Slurm.
Running snakemake with
snakemake --profile config/slurm_truba
would be enough to search the directory and load the profile file that can be used at any time.
Tips and tricks for running workflow
-
Caching between workflow is very useful property that Snakemake provides. To avoid redundant computation between workflow, it is recommended to use cache as following. Although, the implementation should be considered experimental.
$ export SNAKEMAKE_OUTPUT_CACHE=<THE_PATH_OUTPUT_WILL_BE_CACHED>
$ snakemake --use-conda --cache --profile ../config/slurm_truba
-
Monitoring the execution via web url is possible if a panoptes server is installed and configured properly. Running snakemake with
--wms-monitor http://ephesus.sabanciuniv.edu:5000
parameter is neccessary. -
Use
--keep-going
parameter is recommended to proceed running independent jobs in case of any failure on a task. -
The
screen
command or execute the snakemake in background are useful for long set of runs in workflow in order to avoid any troubles in case of connection lost to the user interface or terminal
Distribution and reproducibility
We provide a template for a reproducible and scalable research project using Snakemake and various programming language (python, R) and tools managed by conda package manager. Set of rules is constructed to implement the entire research pipeline starting from genetic databases with querying the homologues, aligning sequences, trimming MSA, generating the phylogenetic tree, ancestral reconstruction and then calculate the score.
To get the same results after running workflow regardless to the platform (local computer, HPC, cloud computing), the framework should have the same set of input files, the version of software/tools. Since all requirement is well defined explicitly, it's possible to get the same results.
Citing this work
Kuru, N., Dereli, O., Akkoyun, E., Bircan, A., Tastan, O., & Adebali, O. (2022). PHACT: Phylogeny-aware computing of tolerance for missense mutations. Molecular Biology and Evolution. https://doi.org/10.1093/molbev/msac114
Data availability
All data obtained during this study is shared on the following url. https://phact.sabanciuniv.edu/pubs/kuru_mbe_2022/
- Please see Publications/Kuru_etal_2022 folder to reproduce the figures in PHACT manuscript.
Acknowledgement
This work was supported by the Health Institutes of Turkey (TUSEB) (Project no: 4587 to O.A.) and EMBO Installation Grant (Project no: 4163 to O.A.) funded by the Scientific and Technological Research Council of Turkey (TÜBİTAK). Most of the numerical calculations reported in this paper were performed at the High Performance and Grid Computing Center (TRUBA resources) of TÜBITAK. The TOSUN cluster at Sabanci University was also used for computational analyses. We also want to thank Nehircan Özdemir for his art illustration of the PHACT approach.
Code Snippets
15 16 | shell: "FastTreeMP {config[fasttree_model]} {input.trimmed_msa} > {output.bestTree} 2>{log}" |
18 19 | shell: "python3 scripts/parse_blastp.py {input.blastp_out} {config[blast_hit_number]} {config[max_e_value]} {config[min_identity]} {params.blastdb} {input.query_fasta} 2> {log}" |
18 19 | shell: "iqtree2 -redo -s {input.no_gap_msa} -te {input.unrooted_tree} -m {config[iqtree_ancestral_model]} -asr -safe -nt {resources.cpus} --prefix {params.iqtree_ancestral_out_name}" |
18 19 | shell: "query=`python scripts/get_query.py {params.query_fasta}` && Rscript scripts/ComputeScore_Nonnormal_withDiversity_0411.R {input.tree_file} {input.probabilities} {params.fasta} {params.out} $query {config[weights]} 2>{log}" |
15 16 | shell: "mafft{config[mafft_method]} --anysymbol --thread {resources.cpus} {input.fasta} > {output.msa_file} 2> {log}" |
15 16 | shell: "psiblast -query {input.query_fasta} -db {params.blastdb} -outfmt {config[outfmt]} -out {output.outfile} -max_target_seqs {config[max_target_seqs]} -num_iterations {config[num_iterations]} 2> {log}" |
14 15 | shell: "python scripts/make_query_fasta.py {params.query_id} {params.blastdb} {output.fasta_file} 2> {log}" |
20 21 | shell: "raxml-ng --ancestral --msa {input.no_gap_msa} --tree {input.unrooted_tree} --model {config[raxmlng_ancestral_model]} --prefix {params.raxml_ancestral_out_name} --threads {resources.cpus}" |
15 16 | shell: "python3 scripts/remove_gaps.py {input.query_file} {input.msa_file} {input.tree_file} {output.no_gap_file} 2> {log}" |
15 16 | shell: "python3 scripts/remove_gaps.py {input.query_file} {input.msa_file} {input.tree_file} {output.no_gap_file} 2> {log}" |
13 14 | shell: "trimal -in {input.msa_file} -out {output.trimmed_msa} {config[trimal_method]} 2> {log}" |
13 14 | shell: "python scripts/unroot_tree.py {input.input_tree} {input.no_gap_msa_file} 2> {log}" |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 | library(ape) library(tidytree) library(stringr) library(dplyr) library(bio3d) #source("./position_score.R") args = commandArgs(trailingOnly=TRUE) aa_to_num <- function(aa) { amino_acids <- c("G", "A", "L", "M", "F", "W", "K", "Q", "E", "S", "P", "V", "I", "C", "Y", "H", "R", "N", "D", "T") num <- sapply(aa, function(a){ifelse(sum(amino_acids %in% a) == 1, as.numeric(which(amino_acids %in% a)), 21)}) # num <- ifelse(sum(amino_acids %in% aa) == 1, as.numeric(which(amino_acids %in% aa)), 21) return(num) } num_to_aa <- function(num) { amino_acids <- c("G", "A", "L", "M", "F", "W", "K", "Q", "E", "S", "P", "V", "I", "C", "Y", "H", "R", "N", "D", "T") aa <- ifelse(num == 21, 21, amino_acids[num]) return(aa) } compute_score <- function(file_nwk, file_rst, file_fasta, output_name, human_id, pos_chosen, parameters) { # Read tree file tr_org <- read.tree(file_nwk) x <- read.table(file = file_rst, sep = '\t', header = TRUE, fill = TRUE) colnames(x)[4:ncol(x)] <- gsub("p_", replacement = "", x = colnames(x)[4:ncol(x)], fixed = TRUE ) # Tree_info: node-node, node-leaf connections tree_info <- as.data.frame(as_tibble(tr_org)) # Read fasta file, MSA fasta <- read.fasta(file = file_fasta) msa <- fasta$ali # connections_1: Parent node, connections_2: connected node/leaf connections_1 <- tree_info$parent connections_2 <- tree_info$node # Names of leaves names_all <- tr_org[["tip.label"]] msa <- msa[names_all, ] # Number of total leaves&nodes num_leaves <- length(tr_org[["tip.label"]]) num_nodes <- tr_org[["Nnode"]] # Unnecessary num_nodes_codeml <- max(connections_1,connections_2) if (num_nodes_codeml-num_leaves != num_nodes){ print("Number of nodes is less than expected: CODEML") } # Distance between leaves & nodes dd_node <- dist.nodes(tr_org) dist_leaf <- dd_node[1:num_leaves, 1:num_leaves] dist_node <- dd_node[(num_leaves+1):(num_leaves + num_nodes), (num_leaves+1):(num_leaves + num_nodes)] # Human position (leaf & node) h_name <- human_id human_codeml <- names_all[grep(pattern = h_name, x = names_all, fixed = TRUE)] leaf_human <- tree_info[which(tree_info$label == human_codeml), "node"] human_plc <- leaf_human node_human <- tree_info[which(tree_info$label == human_codeml), "parent"] nodes_raxml <- as.numeric(gsub(pattern = "Node", replacement = "", x = tree_info[num_leaves+1:num_nodes, "label"])) #Node or Branch names(nodes_raxml) <- tree_info[num_leaves+1:num_nodes, "node"] # Total number of positions from ancestralProbs file total_pos <- max(x$Site) # Chosen positions (all or some) if (pos_chosen[1] == "all"){ positions <- 1:total_pos score_all <- matrix(0, total_pos, 21) } else { positions <- pos_chosen score_all <- matrix(0, length(positions), 21) } #################################################### #################################################### # Connections between leaves & nodes chosen_leaves <- tree_info[1:num_leaves,c("parent", "node")] # Connections between nodes & nodes chosen_nodes <- tree_info[(num_leaves+2):(num_leaves +num_nodes),c("parent", "node")] leaf_names <- tree_info$label human_leaf_len <- as.double(tree_info[human_plc, "branch.length"]) d_n <- dist_node[as.character(node_human),] + human_leaf_len # d_n <- d_n[nn2[,1]] d_l <- dist_leaf[leaf_human,] # chosen_nodes2: ordered connections (for probability differences) chosen_nodes2 <- matrix(0, num_nodes-1, 2) n1 <- as.numeric(chosen_nodes$parent) n2 <- as.numeric(chosen_nodes$node) dist_f <- d_n[as.character(n1)] dist_s <- d_n[as.character(n2)] # chosen_nodes2: ordered connections (for probability differences) chosen_nodes2[which(dist_f < dist_s), 1] <- n2[which(dist_f < dist_s)] chosen_nodes2[which(dist_f < dist_s), 2] <- n1[which(dist_f < dist_s)] chosen_nodes2[which(dist_f >= dist_s), 1] <- n1[which(dist_f >= dist_s)] chosen_nodes2[which(dist_f >= dist_s), 2] <- n2[which(dist_f >= dist_s)] ########################################################################### ############### NEW PART - 15NOV ################ ########################################################################### # Number of nodes between nodes & leaf of human nodes_conn <- numeric(num_nodes) nodes_conn[node_human-num_leaves] <- 1 names(nodes_conn) <- names(d_n) chs <- c() chs2 <- c() ########################## inds <- chosen_nodes2[chosen_nodes2[,2]==(node_human),1] nodes_conn[as.character(inds)] <- 2 chs <- inds s0 <- sapply(3:num_leaves, function(i){ for (j in chs){ inds <- chosen_nodes2[chosen_nodes2[,2]==j,1] if (length(inds)!=0){ nodes_conn[as.character(inds)] <<- i chs2 <- c(chs2, inds) } } chs <<- chs2 chs2 <- c() }) # Number of nodes between leaves & leaf of human leaves_conn <- nodes_conn[as.character(chosen_leaves[,1])] #########################################3 parameters <- unlist(str_split(parameters, pattern = ",")) score_norm <- t(mapply(function(ps, parameter){position_score(ps, x, msa, num_nodes, num_leaves, total_pos, human_plc, node_human, nodes_raxml, human_leaf_len, dist_node, dist_leaf, parameter, leaves_conn, nodes_conn, chosen_leaves, chosen_nodes2, d_n, d_l)}, rep(positions, length(parameters)), rep(parameters, each = length(positions)))) score_norm_with_leaf <- matrix(unlist(score_norm[ ,1]), nrow = length(positions) * length(parameters), ncol = 20, byrow = TRUE) score_norm_without_leaf <- matrix(unlist(score_norm[ ,2]), nrow = length(positions) * length(parameters), ncol = 20, byrow = TRUE) score_norm_with_leaf <- cbind(rep(positions, length(parameters)), score_norm_with_leaf) score_norm_without_leaf <- cbind(rep(positions, length(parameters)), score_norm_without_leaf) colnames(score_norm_with_leaf) <- c("Pos/AA", num_to_aa(1:20)) colnames(score_norm_without_leaf) <- c("Pos/AA", num_to_aa(1:20)) print_wl <- lapply(1:length(parameters), function(p){ score_to_print_wl <- score_norm_with_leaf[(positions + length(positions)*(p - 1)), ] score_to_print_wol <- score_norm_without_leaf[(positions + length(positions)*(p - 1)), ] filename <- ifelse(parameters[p] == "0", "max05", parameters[p]) filename <- ifelse(parameters[p] == "X", "max05_Gauss", parameters[p]) ### New Line (11.11) write.csv(score_to_print_wl, sprintf("%s.csv", paste(output_name, "_wl_param_", filename, sep = "")), row.names = FALSE, quote = FALSE) write.csv(score_to_print_wol, sprintf("%s.csv", paste(output_name, "_wol_param_", filename, sep = "")), row.names = FALSE, quote = FALSE) }) } position_score <- function(ps, x, msa, num_nodes, num_leaves, total_pos, human_plc, node_human, nodes_raxml, human_leaf_len, dist_node, dist_leaf, parameter, leaves_conn, nodes_conn, chosen_leaves, chosen_nodes2, d_n, d_l) { position <- ps b1 <- position + total_pos*(0:(num_nodes-1)) TT <- x[b1,] matrix_prob <- matrix(0, num_nodes, 20) probs <- data.matrix((TT[, (4:ncol(TT))])) rownames(probs) <- NULL rr <- aa_to_num(colnames(x)[4:ncol(TT)]) matrix_prob[,rr] <- probs matrix_prob <- matrix_prob[nodes_raxml,] position_vec <- msa[, ps] position_num <- aa_to_num(position_vec) prob_leaves <- matrix(0, num_leaves, 20) prob_leaves[cbind(which(position_num <= 20), position_num[which(position_num <= 20)])] <- 1 gaps <- which(position_num == 21) tot_sc <- 1 mxx <- 1 diff_leaves <- matrix(0, num_leaves, 20) diff_leaves <- prob_leaves - matrix_prob[(chosen_leaves$parent - num_leaves), ] diff_leaves[human_plc,]<- -diff_leaves[human_plc,] diff_nodes <- matrix(0, num_nodes-1, 20) diff_nodes <- matrix_prob[chosen_nodes2[,1] - num_leaves, ] - matrix_prob[chosen_nodes2[,2] - num_leaves, ] ################## weights weights <- weight_fnc(d_n, d_l, human_plc, parameter, leaves_conn, nodes_conn, mxx) # Updated with related parameters weight_leaf <- weights[1:num_leaves] weight_node <- tail(weights,num_nodes) #################### score <- matrix(0,1,20) s1 <- sapply(1:20, function(ii){ dif_pr <- diff_nodes[1:(num_nodes-1),ii] dif_pr[dif_pr<0] <- 0 # CHANGED 29.03 sel_node <- chosen_nodes2[1:length(dif_pr), 1] - num_leaves score[ii] <<- score[ii] + sum(weight_node[sel_node] * dif_pr) }) ### NOVEL 29.03 aa_f <- position_num[human_plc] vect_human <- matrix_prob[node_human - num_leaves,] vect_human[aa_f]<-0 score_without_leaf <- score score_without_leaf <- score_without_leaf + weight_node[(node_human-num_leaves)]*vect_human score <- score + weight_node[(node_human-num_leaves)]*vect_human s2 <- sapply(1:20, function(ii){ diff_lf <- diff_leaves[1:num_leaves,ii] diff_lf[gaps] <- 0 diff_lf[diff_lf<0] <- 0 s1 <- sum(weight_leaf[((1:length(diff_lf)) != human_plc )] * diff_lf[((1:length(diff_lf)) != human_plc )]) score[ii] <<- score[ii] + s1 }) aa_f <- position_num[human_plc] if (aa_f != 21){ score[aa_f] <- score[aa_f] + weight_leaf[human_plc]*1 score_without_leaf[aa_f] <- score_without_leaf[aa_f] + weight_leaf[human_plc]*1 } sum_exc_max <- sum(score_without_leaf)-max(score_without_leaf) diversity <- (-(length(which(score_without_leaf<0.0001))*0.1)/20+0.1)*(sum_exc_max) scores <- list() scores$score_with_leaf <- 1- log((score*0.9 + diversity)/(num_nodes+num_leaves)+10^(-15))/log(10^(-15)) scores$score_without_leaf <- 1- log((score_without_leaf*0.9 + diversity)/num_nodes + 10^(-15))/log(10^(-15)) return(scores) } weight_fnc <- function(d_n, d_l, human_plc, parameter, leaves_conn, nodes_conn, mxx) { # print(parameter) if (parameter=="0"){ d_l_n <- d_l[-human_plc] min_l <- min(d_l_n) d_l2 <- d_l/min_l d_n2 <- d_n/min_l d_l2 <- d_l2 +1 d_n2 <- d_n2 +1 weight_node <- 1/d_n2 weight_leaf <- 1/d_l2 } else if (parameter=="0_MinNode"){ d_l_n <- d_l[-human_plc] min_l <- min(d_n) d_l2 <- d_l/min_l d_n2 <- d_n/min_l d_l2 <- d_l2 +1 d_n2 <- d_n2 +1 weight_node <- 1/d_n2 weight_leaf <- 1/d_l2 } else if (parameter=="0_MinNode_Mix"){ d_l_n <- d_l[-human_plc] min_l <- min(d_n) d_l2 <- d_l/min_l d_n2 <- d_n/min_l d_l2 <- d_l2 +1 d_n2 <- d_n2 +1 weight_node1 <- 1/d_n2 weight_leaf1 <- 1/d_l2 param <- mean(c(d_n, d_l)) weight_leaf2 <- exp(-d_l^2/param^2) weight_node2 <- exp(-d_n^2/param^2) weight_node<-sqrt(weight_node1*weight_node2) weight_leaf<-sqrt(weight_leaf1*weight_leaf2) } else if (parameter=="0_MinNode_Mix2"){ d_l_n <- d_l[-human_plc] min_l <- min(d_n) d_l2 <- d_l/min_l d_n2 <- d_n/min_l d_l2 <- d_l2 +1 d_n2 <- d_n2 +1 weight_node1 <- 1/d_n2 weight_leaf1 <- 1/d_l2 param <- mean(c(d_n, d_l)) weight_leaf2 <- exp(-d_l^2/param^2)/2 weight_node2 <- exp(-d_n^2/param^2)/2 weight_node<-sqrt(weight_node1*weight_node2) weight_leaf<-sqrt(weight_leaf1*weight_leaf2) } else if (parameter == "mean"){ param <- mean(c(d_n, d_l)) weight_leaf <- exp(-d_l^2/param^2) weight_node <- exp(-d_n^2/param^2) } else if (parameter == "median"){ param <- median(c(d_n, d_l)) weight_leaf <- exp(-d_l^2/param^2) weight_node <- exp(-d_n^2/param^2) } else if (parameter == "X"){ d_l_n <- d_l[-human_plc] min_l <- min(d_l_n) min_n <- min(d_n) min_ch <- min(min_l, min_n) d_l2 <- d_l-min_ch d_n2 <- d_n-min_ch weight_leaf <- exp(-d_l^2)/2 weight_node <- exp(-d_n^2)/2 weight_leaf[human_plc] <- 1 } else if (parameter == "CountNodes_1"){ ### NewFunction 1 (15Nov) weight_node = (exp(-d_n^2) + 1/nodes_conn)/2 weight_leaf = (exp(-d_l^2) + 1/leaves_conn)/2 } else if (parameter == "CountNodes_2"){ ### NewFunction 2 (15Nov) weight_node = (exp(-d_n^2) + exp(-nodes_conn^2))/2 weight_leaf = (exp(-d_l^2) + exp(-leaves_conn^2))/2 } else if (parameter == "CountNodes_3"){ ### NewFunction 3 (15Nov) weight_node = sqrt(exp(-d_n^2)*1/nodes_conn) weight_leaf = sqrt(exp(-d_l^2)*1/leaves_conn) } else if (parameter == "CountNodes_4"){ ### NewFunction 4 (15Nov) weight_node = exp(-(sqrt(d_n*nodes_conn))^2) weight_leaf = exp(-(sqrt(d_l*leaves_conn))^2) } else if (parameter == "Equal"){ weight_leaf <- matrix(1,1,length(d_l)) weight_node <- matrix(1,1,length(d_n)) } else if (parameter == "MinThreshold"){ max_dis <- max(c(d_l, d_n)) param_min <- as.double(param_min) weight_node <- (-1+param_min)/max_dis*d_n + 1 weight_leaf <- (-1+param_min)/max_dis*d_l + 1 } else if (parameter == "MinThreshold_Gauss"){ max_dis <- max(c(d_l, d_n)) param_min <- as.double(param_min) param <- sqrt(-max_dis^2/log(param_min)) weight_leaf <- exp(-d_l^2/param^2) weight_node <- exp(-d_n^2/param^2) } else { param <- as.double(parameter) weight_leaf <- exp(-d_l^2/param^2) weight_node <- exp(-d_n^2/param^2) } weights = c(weight_leaf, weight_node) return(weights) } csv_file <- compute_score(file_nwk=args[1],file_rst=args[2],file_fasta=args[3], output_name=args[4],human_id=args[5],'all', parameters = args[6]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import re import sys def get_human_id(fasta_file): with open(str(fasta_file),'r') as f: for line in f: if line.startswith('>'): line = line[0] + line[1:].replace('>', '-') header = line.split(">")[1].strip() human_id = header.split(" ")[0]+"_"+header.split("OX=")[1].split(" ")[0] human_id = re.sub(r'[^\w>]', '_',human_id) print(human_id) return human_id get_human_id(sys.argv[1]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | import os import sys def make_query_fasta(acc_no,all_eu_file,output_path): protein_dict = {} with open(all_eu_file,'r') as f: for line in f: if line.startswith('>'): line = line[0] + line[1:].replace('>', '-') header = line.split(">")[1].strip() if header.split("|")[1].split("|")[0] == acc_no: line = next(f).strip() protein_dict[header] = '' while not line.startswith('>'): protein_dict[header] += line line = next(f).strip() f.close() with open(output_path,'w') as f: for header,sequence in protein_dict.items(): f.write(">"+header+"\n" + sequence + "\n") f.close() if __name__ == "__main__": acc_no = sys.argv[1] all_eu_file = sys.argv[2] output_path = sys.argv[3] make_query_fasta(acc_no,all_eu_file,output_path) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | import os import sys import re def get_fasta_dict(fasta_file): fasta_dict = {} with open(fasta_file,'r') as f: for line in f: if line.startswith('>'): line = line[0] + line[1:].replace('>', '-') header = line.split(">")[1].strip() fasta_dict[header] = '' else: fasta_dict[header] += line.strip() f.close() return fasta_dict def parse_blastp_out(blastp_out_file,blast_hit_number,max_e_value,min_identity): blast_list = [] with open(blastp_out_file,'r') as filein_: for line in filein_: if len(blast_list) <= int(blast_hit_number): if line.startswith(">"): best_hit = line.split(">")[1].strip() best_hit = best_hit.split(" ")[0] while not line.startswith(" Score"): line = next(filein_) if line.startswith(" Score"): score = float(line.split("Score = ")[1].split(" bits")[0]) e_value = float(line.split("Expect = ")[1].split(",")[0]) identity_line = next(filein_) identity = float(identity_line.split("Identities = ")[1].split("(")[1].split("%")[0]) if e_value < float(max_e_value) and identity>float(min_identity): blast_list.append(best_hit) return blast_list def write_to_file(blast_list,all_eu_file,blastp_out_file,query_fasta): query_fasta_dict = get_fasta_dict(query_fasta) all_eu_dict = get_fasta_dict(all_eu_file) all_eu_keys_list = [] for k,v in all_eu_dict.items(): all_eu_keys_list.append(k.split(" ")[0].split("|")[1]) if set(all_eu_keys_list).issuperset(set(blast_list)): with open(blastp_out_file.split(".")[0]+".fasta",'w') as f: for k,v in all_eu_dict.items(): if k.split(" ")[0].split("|")[1] in blast_list: new_k = k.split(" ")[0]+"_"+k.split("OX=")[1].split(" ")[0] new_k = re.sub(r'[^\w>]', '_',new_k) f.write(">"+new_k + "\n" + v +"\n") for h,s in query_fasta_dict.items(): if h.split(" ")[0].split("|")[1] in blast_list: pass else: new_h = h.split(" ")[0]+"_"+h.split("OX=")[1].split(" ")[0] new_h = re.sub(r'[^\w>]', '_',new_h) f.write(">"+new_h + "\n" + s +"\n") f.close() if __name__ == "__main__": blastp_out_file = sys.argv[1] blast_hit_number = sys.argv[2] max_e_value = sys.argv[3] min_identity = sys.argv[4] blastdb_file = sys.argv[5] query_fasta = sys.argv[6] blast_list = parse_blastp_out(blastp_out_file,blast_hit_number,max_e_value,min_identity) write_to_file(blast_list,blastdb_file,blastp_out_file,query_fasta) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | import os import sys from fasta_dict import * import re import ete3 from ete3 import Tree def get_human_id(query_file): query_dict = get_fasta_dict(query_file) for k,v in query_dict.items(): human_id = k.split(" ")[0]+"_"+k.split("OX=")[1].split(" ")[0] human_id = re.sub(r'[^\w>]', '_',human_id) #print(human_id) return human_id def get_gap_positions(seqDict,human_id): for key, value in seqDict.items(): if human_id in key: #print(value) indices = [i for i, x in enumerate(value) if x == "-"] #print(indices) return indices def get_sequences_without_gap(seqDict,indices): new_seqDict = {} for k,v in seqDict.items(): new_seqDict[k] = ''.join([v[i] for i in range(len(v)) if i not in indices]) #print(new_seqDict) return new_seqDict def write_new_fasta(new_seqDict,fasta_file,tree_file,output_file): t = Tree(tree_file,format=1) leaves = [] for leaf in t: leaves.append(leaf.name) with open (output_file,'w') as new_file: #print(new_file) for newheader,value in new_seqDict.items(): if newheader in leaves: new_value = re.sub(r'[BXJZUO#$]', '-',value) if not all([c=="-" for c in new_value]): new_file.write(">" +newheader+"\n" + new_value+ "\n") new_file.close() if __name__ == "__main__": query_file = sys.argv[1] fasta_file= sys.argv[2] tree_file= sys.argv[3] output_file= sys.argv[4] human_id= get_human_id(query_file) seqDict = get_fasta_dict(fasta_file) gap_indices = get_gap_positions(seqDict,human_id) new_seqDict = get_sequences_without_gap(seqDict,gap_indices) write_new_fasta(new_seqDict,fasta_file,tree_file,output_file) |
Support
- Future updates
Related Workflows





