Python Protein conformational ensembles generation, building on PDBe-KB
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Protein Conformational ensembles generation
Building on PDBe-KB to chart and characterize the conformation landscape of native proteins
This tutorial aims to illustrate the process of generating protein conformational ensembles from** 3D structures **and analysing its molecular flexibility , step by step, using the BioExcel Building Blocks library (biobb) .
Conformational landscape of native proteins
Proteins are dynamic systems that adopt multiple conformational states , a property essential for many biological processes (e.g. binding other proteins, nucleic acids, small molecule ligands, or switching between functionaly active and inactive states). Characterizing the different conformational states of proteins and the transitions between them is therefore critical for gaining insight into their biological function and can help explain the effects of genetic variants in health and disease and the action of drugs.
Structural biology has become increasingly efficient in sampling the different conformational states of proteins. The PDB has currently archived more than 170,000 individual structures , but over two thirds of these structures represent multiple conformations of the same or related protein, observed in different crystal forms, when interacting with other proteins or other macromolecules, or upon binding small molecule ligands. Charting this conformational diversity across the PDB can therefore be employed to build a useful approximation of the conformational landscape of native proteins.
A number of resources and tools describing and characterizing various often complementary aspects of protein conformational diversity in known structures have been developed, notably by groups in Europe. These tools include algorithms with varying degree of sophistication, for aligning the 3D structures of individual protein chains or domains, of protein assemblies, and evaluating their degree of structural similarity . Using such tools one can align structures pairwise , compute the corresponding similarity matrix , and identify ensembles of structures/conformations with a defined similarity level that tend to recur in different PDB entries, an operation typically performed using clustering methods. Such workflows are at the basis of resources such as CATH, Contemplate, or PDBflex that offer access to conformational ensembles comprised of similar conformations clustered according to various criteria. Other types of tools focus on differences between protein conformations , identifying regions of proteins that undergo large collective displacements in different PDB entries, those that act as hinges or linkers , or regions that are inherently flexible .
To build a meaningful approximation of the conformational landscape of native proteins, the conformational ensembles (and the differences between them), identified on the basis of structural similarity/dissimilarity measures alone, need to be biophysically characterized . This may be approached at two different levels .
-
At the biological level , it is important to link observed conformational ensembles , to their functional roles by evaluating the correspondence with protein family classifications based on sequence information and functional annotations in public databases e.g. Uniprot, PDKe-Knowledge Base (KB). These links should provide valuable mechanistic insights into how the conformational and dynamic properties of proteins are exploited by evolution to regulate their biological function .
-
At the physical level one needs to introduce energetic consideration to evaluate the likelihood that the identified conformational ensembles represent conformational states that the protein (or domain under study) samples in isolation. Such evaluation is notoriously challenging and can only be roughly approximated by using computational methods to evaluate the extent to which the observed conformational ensembles can be reproduced by algorithms that simulate the dynamic behavior of protein systems. These algorithms include the computationally expensive classical molecular dynamics (MD) simulations to sample local thermal fluctuations but also faster more approximate methods such as Elastic Network Models and Normal Node Analysis (NMA) to model low energy collective motions . Alternatively, enhanced sampling molecular dynamics can be used to model complex types of conformational changes but at a very high computational cost.
The ELIXIR 3D-Bioinfo Implementation Study Building on PDBe-KB to chart and characterize the conformation landscape of native proteins focuses on:
- Mapping the conformational diversity of proteins and their homologs across the PDB.
- Characterize the different flexibility properties of protein regions, and link this information to sequence and functional annotation.
- Benchmark computational methods that can predict a biophysical description of protein motions.
This notebook is part of the third objective, where a list of computational resources that are able to predict protein flexibility and conformational ensembles have been collected, evaluated, and integrated in reproducible and interoperable workflows using the BioExcel Building Blocks library . Note that the list is not meant to be exhaustive, it is built following the expertise of the implementation study partners.
Code Snippets
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | import time import argparse import zipfile from pathlib import Path, PurePath from biobb_common.configuration import settings from biobb_common.tools import file_utils as fu from biobb_structure_utils.utils.extract_model import extract_model from biobb_structure_utils.utils.extract_chain import extract_chain from biobb_analysis.ambertools.cpptraj_mask import cpptraj_mask from biobb_flexdyn.flexdyn.concoord_dist import concoord_dist from biobb_flexdyn.flexdyn.concoord_disco import concoord_disco from biobb_analysis.ambertools.cpptraj_rms import cpptraj_rms from biobb_analysis.ambertools.cpptraj_convert import cpptraj_convert from biobb_flexdyn.flexdyn.prody_anm import prody_anm from biobb_flexserv.flexserv.bd_run import bd_run from biobb_flexserv.flexserv.dmd_run import dmd_run from biobb_flexserv.flexserv.nma_run import nma_run from biobb_flexdyn.flexdyn.nolb_nma import nolb_nma from biobb_flexdyn.flexdyn.imod_imode import imod_imode from biobb_flexdyn.flexdyn.imod_imc import imod_imc from biobb_gromacs.gromacs.make_ndx import make_ndx from biobb_gromacs.gromacs.trjcat import trjcat from biobb_analysis.gromacs.gmx_cluster import gmx_cluster from biobb_flexserv.pcasuite.pcz_zip import pcz_zip from biobb_flexserv.pcasuite.pcz_info import pcz_info from biobb_flexserv.pcasuite.pcz_evecs import pcz_evecs from biobb_flexserv.pcasuite.pcz_animate import pcz_animate from biobb_flexserv.pcasuite.pcz_bfactor import pcz_bfactor from biobb_flexserv.pcasuite.pcz_hinges import pcz_hinges from biobb_flexserv.pcasuite.pcz_stiffness import pcz_stiffness from biobb_flexserv.pcasuite.pcz_collectivity import pcz_collectivity def main(config, system=None): start_time = time.time() conf = settings.ConfReader(config, system) global_log, _ = fu.get_logs(path=conf.get_working_dir_path(), light_format=True) global_prop = conf.get_prop_dic(global_log=global_log) global_paths = conf.get_paths_dic() global_log.info("step0_extract_model: Selecting a model") extract_model(**global_paths["step0_extract_model"], properties=global_prop["step0_extract_model"]) global_log.info("step1_extract_chain: Selecting a monomer") extract_chain(**global_paths["step1_extract_chain"], properties=global_prop["step1_extract_chain"]) global_log.info("step2_cpptraj_mask: Generating the reduced (coarse-grained) structure (backbone)") cpptraj_mask(**global_paths["step2_cpptraj_mask"], properties=global_prop["step2_cpptraj_mask"]) global_log.info("step3_cpptraj_mask: Generating the reduced (coarse-grained) structure (alpha carbons)") cpptraj_mask(**global_paths["step3_cpptraj_mask"], properties=global_prop["step3_cpptraj_mask"]) global_log.info("step4_concoord_dist: CONCOORD dist") concoord_dist(**global_paths["step4_concoord_dist"], properties=global_prop["step4_concoord_dist"]) global_log.info("step5_concoord_disco: CONCOORD disco") concoord_disco(**global_paths["step5_concoord_disco"], properties=global_prop["step5_concoord_disco"]) global_log.info("step6_cpptraj_rms: RMSd distribution of the generated ensemble (CONCOORD)") cpptraj_rms(**global_paths["step6_cpptraj_rms"], properties=global_prop["step6_cpptraj_rms"]) global_log.info("step7_cpptraj_convert: Converting the generated ensemble (CONCOORD)") cpptraj_convert(**global_paths["step7_cpptraj_convert"], properties=global_prop["step7_cpptraj_convert"]) global_log.info("step8_prody_anm: PRODY Anisotropic network model") prody_anm(**global_paths["step8_prody_anm"], properties=global_prop["step8_prody_anm"]) global_log.info("step9_cpptraj_rms: RMSd distribution of the generated ensemble (PRODY)") cpptraj_rms(**global_paths["step9_cpptraj_rms"], properties=global_prop["step9_cpptraj_rms"]) global_log.info("step10_cpptraj_convert: Converting the generated ensemble (PRODY)") cpptraj_convert(**global_paths["step10_cpptraj_convert"], properties=global_prop["step10_cpptraj_convert"]) global_log.info("step11_bd_run: Brownian Dynamics (BD)") bd_run(**global_paths["step11_bd_run"], properties=global_prop["step11_bd_run"]) global_log.info("step12_cpptraj_rms: Fitting and converting the generated ensemble (BD)") cpptraj_rms(**global_paths["step12_cpptraj_rms"], properties=global_prop["step12_cpptraj_rms"]) global_log.info("step13_dmd_run: Discrete Molecular Dynamics (DMD)") dmd_run(**global_paths["step13_dmd_run"], properties=global_prop["step13_dmd_run"]) global_log.info("step14_cpptraj_rms: RMSd distribution of the generated ensemble (DMD)") cpptraj_rms(**global_paths["step14_cpptraj_rms"], properties=global_prop["step14_cpptraj_rms"]) global_log.info("step15_nma_run: Normal Mode Analysis (NMA)") nma_run(**global_paths["step15_nma_run"], properties=global_prop["step15_nma_run"]) global_log.info("step16_cpptraj_rms: RMSd distribution of the generated ensemble (NMA)") cpptraj_rms(**global_paths["step16_cpptraj_rms"], properties=global_prop["step16_cpptraj_rms"]) global_log.info("step17_cpptraj_convert: Converting the generated ensemble (NMA)") cpptraj_convert(**global_paths["step17_cpptraj_convert"], properties=global_prop["step17_cpptraj_convert"]) global_log.info("step18_nolb_nma: NOn-Linear rigid Block NMA approach (NOLB)") nolb_nma(**global_paths["step18_nolb_nma"], properties=global_prop["step18_nolb_nma"]) global_log.info("step19_cpptraj_rms: RMSd distribution of the generated ensemble (NOLB)") cpptraj_rms(**global_paths["step19_cpptraj_rms"], properties=global_prop["step19_cpptraj_rms"]) global_log.info("step20_cpptraj_convert: Converting the generated ensemble (NOLB)") cpptraj_convert(**global_paths["step20_cpptraj_convert"], properties=global_prop["step20_cpptraj_convert"]) global_log.info("step21_imod_imode: iMOD imode") imod_imode(**global_paths["step21_imod_imode"], properties=global_prop["step21_imod_imode"]) global_log.info("step22_imod_imc: iMOD imc") imod_imc(**global_paths["step22_imod_imc"], properties=global_prop["step22_imod_imc"]) global_log.info("step23_cpptraj_rms: RMSd distribution of the generated ensemble (iMOD)") cpptraj_rms(**global_paths["step23_cpptraj_rms"], properties=global_prop["step23_cpptraj_rms"]) global_log.info("step24_cpptraj_convert: Converting the generated ensemble (iMOD)") cpptraj_convert(**global_paths["step24_cpptraj_convert"], properties=global_prop["step24_cpptraj_convert"]) global_log.info("Compressing trajectories") traj_zip = "structure_concat_traj.zip" traj_list = [global_paths["step7_cpptraj_convert"]["output_cpptraj_path"], global_paths["step10_cpptraj_convert"]["output_cpptraj_path"], global_paths["step24_cpptraj_convert"]["output_cpptraj_path"], global_paths["step12_cpptraj_rms"]["output_traj_path"], global_paths["step17_cpptraj_convert"]["output_cpptraj_path"]] with zipfile.ZipFile(traj_zip, 'w') as myzip: for file in traj_list: myzip.write(file, PurePath(file).name, compress_type=zipfile.ZIP_DEFLATED) paths = global_paths["step25_trjcat"] paths["input_trj_zip_path"] = PurePath(Path().absolute()).joinpath(traj_zip) global_log.info("step25_trjcat: Building the meta-trajectory") trjcat(**paths, properties=global_prop["step25_trjcat"]) global_log.info("step26_make_ndx: Clustering the meta-trajectory: make index") make_ndx(**global_paths["step26_make_ndx"], properties=global_prop["step26_make_ndx"]) global_log.info("step27_gmx_cluster: Clustering the meta-trajectory: clustering") gmx_cluster(**global_paths["step27_gmx_cluster"], properties=global_prop["step27_gmx_cluster"]) global_log.info("step28_cpptraj_rms: Fitting and converting the generated ensemble (meta-trajectory)") cpptraj_rms(**global_paths["step28_cpptraj_rms"], properties=global_prop["step28_cpptraj_rms"]) global_log.info("step29_pcz_zip: Computing the Principal Component Analysis (PCA) - ensemble") pcz_zip(**global_paths["step29_pcz_zip"], properties=global_prop["step29_pcz_zip"]) global_log.info("step30_pcz_zip: Computing the Principal Component Analysis (PCA) - ensemble gaussian") pcz_zip(**global_paths["step30_pcz_zip"], properties=global_prop["step30_pcz_zip"]) global_log.info("step31_pcz_info: Analysing the PCA report") pcz_info(**global_paths["step31_pcz_info"], properties=global_prop["step31_pcz_info"]) global_log.info("step32_pcz_evecs: PCA Eigenvectors & Eigenvalues") pcz_evecs(**global_paths["step32_pcz_evecs"], properties=global_prop["step32_pcz_evecs"]) global_log.info("step33_pcz_animate: Animate Principal Components") pcz_animate(**global_paths["step33_pcz_animate"], properties=global_prop["step33_pcz_animate"]) global_log.info("step34_cpptraj_convert: Converting the generated ensemble (PCA)") cpptraj_convert(**global_paths["step34_cpptraj_convert"], properties=global_prop["step34_cpptraj_convert"]) global_log.info("step35_pcz_bfactor: B-Factor x Principal Components") pcz_bfactor(**global_paths["step35_pcz_bfactor"], properties=global_prop["step35_pcz_bfactor"]) global_log.info("step36_pcz_hinges: Hinge Points Prediction - Bfactor_slope") pcz_hinges(**global_paths["step36_pcz_hinges"], properties=global_prop["step36_pcz_hinges"]) global_log.info("step37_pcz_hinges: Hinge Points Prediction - Dynamic_domain") pcz_hinges(**global_paths["step37_pcz_hinges"], properties=global_prop["step37_pcz_hinges"]) global_log.info("step38_pcz_hinges: Hinge Points Prediction - Force_constant") pcz_hinges(**global_paths["step38_pcz_hinges"], properties=global_prop["step38_pcz_hinges"]) global_log.info("step39_pcz_stiffness: Apparent Stiffness") pcz_stiffness(**global_paths["step39_pcz_stiffness"], properties=global_prop["step39_pcz_stiffness"]) global_log.info("step40_pcz_collectivity: Collectivity Index") pcz_collectivity(**global_paths["step40_pcz_collectivity"], properties=global_prop["step40_pcz_collectivity"]) elapsed_time = time.time() - start_time global_log.info('') global_log.info('') global_log.info('Execution successful: ') global_log.info(' Workflow_path: %s' % conf.get_working_dir_path()) global_log.info(' Config File: %s' % config) if system: global_log.info(' System: %s' % system) global_log.info('') global_log.info('Elapsed time: %.1f minutes' % (elapsed_time/60)) global_log.info('') if __name__ == '__main__': parser = argparse.ArgumentParser(description="Based on the official Gromacs tutorial") parser.add_argument('--config', required=True) parser.add_argument('--system', required=False) args = parser.parse_args() main(args.config, args.system) |
Support
- Future updates
Related Workflows





