As structural biology and drug discovery depend on high-quality protein structures, assessment tools are essential. We describe a new method for validating amino-acid conformations: "PhiSiCal ([Formula: see text]al) Checkup." Twenty new joint probability distributions in the form of statistical mixture models explain the empirical distributions of dihedral angles [Formula: see text] of canonical amino acids in experimental protein structures.
View Article and Find Full Text PDFUnlabelled: The tendency of an amino acid to adopt certain configurations in folded proteins is treated here as a statistical estimation problem. We model the joint distribution of the observed mainchain and sidechain dihedral angles (〈ϕ,ψ,χ1,χ2,…〉) of any amino acid by a mixture of a product of von Mises probability distributions. This mixture model maps any vector of dihedral angles to a point on a multi-dimensional torus.
View Article and Find Full Text PDFSummary: Sequences of proteins evolve by accumulating substitutions together with insertions and deletions (indels) of amino acids. However, it remains a common practice to disconnect substitutions and indels, and infer approximate models for each of them separately, to quantify sequence relationships. Although this approach brings with it computational convenience (which remains its primary motivation), there is a dearth of attempts to unify and model them systematically and together.
View Article and Find Full Text PDFMotivation: Alignments are correspondences between sequences. How reliable are alignments of amino acid sequences of proteins, and what inferences about protein relationships can be drawn? Using techniques not previously applied to these questions, by weighting every possible sequence alignment by its posterior probability we derive a formal mathematical expectation, and develop an efficient algorithm for computation of the distance between alternative alignments allowing quantitative comparisons of sequence-based alignments with corresponding reference structure alignments.
Results: By analyzing the sequences and structures of 1 million protein domain pairs, we report the variation of the expected distance between sequence-based and structure-based alignments, as a function of (Markov time of) sequence divergence.
What is the architectural "basis set" of the observed universe of protein structures? Using information-theoretic inference, we answer this question with a dictionary of 1,493 substructures-called -typically at a subdomain level, based on an unbiased subset of known protein structures. Each represents a topologically conserved assembly of helices and strands that make contact. Any protein structure can be dissected into instances of concepts from this dictionary.
View Article and Find Full Text PDFWe have modeled modifications of a known ligand to the SARS-CoV-2 (COVID-19) protease, that can form a covalent adduct, plus additional ligand-protein hydrogen bonds.
View Article and Find Full Text PDFUnlabelled: The information criterion of minimum message length (MML) provides a powerful statistical framework for inductive reasoning from observed data. We apply MML to the problem of protein sequence comparison using finite state models with Dirichlet distributions. The resulting framework allows us to supersede the ad hoc cost functions commonly used in the field, by systematically addressing the problem of arbitrariness in alignment parameters, and the disconnect between substitution scores and gap costs.
View Article and Find Full Text PDFWe recently developed an unsupervised Bayesian inference methodology to automatically infer a dictionary of protein supersecondary structures (Subramanian et al., IEEE data compression conference proceedings (DCC), 340-349, 2017). Specifically, this methodology uses the information-theoretic framework of minimum message length (MML) criterion for hypothesis selection (Wallace, Statistical and inductive inference by minimum message length, Springer Science & Business Media, New York, 2005).
View Article and Find Full Text PDFMotivation: Structural molecular biology depends crucially on computational techniques that compare protein three-dimensional structures and generate structural alignments (the assignment of one-to-one correspondences between subsets of amino acids based on atomic coordinates). Despite its importance, the structural alignment problem has not been formulated, much less solved, in a consistent and reliable way. To overcome these difficulties, we present here a statistical framework for the precise inference of structural alignments, built on the Bayesian and information-theoretic principle of Minimum Message Length (MML).
View Article and Find Full Text PDFMethods for measuring genetic distances in phylogenetics are known to be sensitive to the evolutionary model assumed. However, there is a lack of established methodology to accommodate the trade-off between incorporating sufficient biological reality and avoiding model overfitting. In addition, as traditional methods measure distances based on the observed number of substitutions, their tend to underestimate distances between diverged sequences due to backward and parallel substitutions.
View Article and Find Full Text PDFThe problem of superposition of two corresponding vector sets by minimizing their sum-of-squares error under orthogonal transformation is a fundamental task in many areas of science, notably structural molecular biology. This problem can be solved exactly using an algorithm whose time complexity grows linearly with the number of correspondences. This efficient solution has facilitated the widespread use of the superposition task, particularly in studies involving macromolecular structures.
View Article and Find Full Text PDFMotivation: Progress in protein biology depends on the reliability of results from a handful of computational techniques, structural alignments being one. Recent reviews have highlighted substantial inconsistencies and differences between alignment results generated by the ever-growing stock of structural alignment programs. The lack of consensus on how the quality of structural alignments must be assessed has been identified as the main cause for the observed differences.
View Article and Find Full Text PDFAtomic coordinates in the Worldwide Protein Data Bank (wwPDB) are generally reported to greater precision than the experimental structure determinations have actually achieved. By using information theory and data compression to study the compressibility of protein atomic coordinates, it is possible to quantify the amount of randomness in the coordinate data and thereby to determine the realistic precision of the reported coordinates. On average, the value of each C(α) coordinate in a set of selected protein structures solved at a variety of resolutions is good to about 0.
View Article and Find Full Text PDFMotivation: Secondary structure underpins the folding pattern and architecture of most proteins. Accurate assignment of the secondary structure elements is therefore an important problem. Although many approximate solutions of the secondary structure assignment problem exist, the statement of the problem has resisted a consistent and mathematically rigorous definition.
View Article and Find Full Text PDFUnlabelled: Simple and concise representations of protein-folding patterns provide powerful abstractions for visualizations, comparisons, classifications, searching and aligning structural data. Structures are often abstracted by replacing standard secondary structural features-that is, helices and strands of sheet-by vectors or linear segments. Relying solely on standard secondary structure may result in a significant loss of structural information.
View Article and Find Full Text PDFA biological compression model, expert model, is presented which is superior to existing compression algorithms in both compression performance and speed. The model is able to compress whole eukaryotic genomes. Most importantly, the model provides a framework for knowledge discovery from biological data.
View Article and Find Full Text PDFBackground: Traditional genome alignment methods consider sequence alignment as a variation of the string edit distance problem, and perform alignment by matching characters of the two sequences. They are often computationally expensive and unable to deal with low information regions. Furthermore, they lack a well-principled objective function to measure the performance of sets of parameters.
View Article and Find Full Text PDFPsychopharmacology (Berl)
February 2011
Rationale: Much evidence indicates that individuals use tobacco primarily to experience the psychopharmacological properties of nicotine. Varenicline, a partial α4β2 nicotinic acetylcholine receptor (nAChR) agonist, is effective in reducing nicotine craving and relapse in smokers, suggesting that α4β2 nAChRs may play a key role in nicotine dependence. In rats, the effect of varenicline on nicotine intake has only been studied with limited access to the drug, a model of the positive reinforcing effect of nicotine.
View Article and Find Full Text PDFBackground: Features of a DNA sequence can be found by compressing the sequence under a suitable model; good compression implies low information content. Good DNA compression models consider repetition, differences between repeats, and base distributions. From a linear DNA sequence, a compression model can produce a linear information sequence.
View Article and Find Full Text PDFUnlabelled: Described is an algorithm to find the longest interval having at least a specified minimum bias in a sequence of characters (bases, amino acids), e.g. 'at least 0.
View Article and Find Full Text PDF