Knowledge Discovery in Databases (KDD) refers to the use of methodologies from machine learning, pattern recognition, statistics, and other fields to extract knowledge from large collections of data, where the knowledge is not explicitly available as part of the database structure. In this paper, we describe four modern data mining techniques, Rough Set Theory (RST), Association Rule Mining (ARM), Emerging Pattern Mining (EP), and Formal Concept Analysis (FCA), and we have attempted to give an exhaustive list of their chemoinformatics applications. One of the main strengths of these methods is their descriptive ability.
View Article and Find Full Text PDFSpectral clustering involves placing objects into clusters based on the eigenvectors and eigenvalues of an associated matrix. The technique was first applied to molecular data by Brewer [J. Chem.
View Article and Find Full Text PDFA liquid is composed of an ensemble of molecules that populate a large number of different states, so calculation of the solvation energy of a molecule in solution requires a method for summing the interactions with the environment over all of these states. The surface site interaction model for the properties of liquids at equilibrium (SSIMPLE) simplifies the surface of a molecule to a discrete number of specific interaction sites (SSIPs). The thermodynamic properties of these interaction sites can be characterised experimentally, for example, through measurement of association constants for the formation of simple complexes that feature a single H-bonding interaction.
View Article and Find Full Text PDFSimilarities in the 3D patterns of amino acid side chains can provide insights into their function despite the absence of any detectable sequence or fold similarities. Search for protein sites (SPRITE) and amino acid pattern search for substructures and motifs (ASSAM) are graph theoretical programs that can search for 3D amino side chain matches in protein structures, by representing the amino acid side chains as pseudo-atoms. The geometric relationship of the pseudo-atoms to each other as a pattern can be represented as a labeled graph where the pseudo-atoms are the graph's nodes while the edges are the inter-pseudo-atomic distances.
View Article and Find Full Text PDFA program for overlaying multiple flexible molecules has been developed. Candidate overlays are generated by a novel fingerprint algorithm, scored on three objective functions (union volume, hydrogen-bond match, and hydrophobic match), and ranked by constrained Pareto ranking. A diverse subset of the best ranked solutions is chosen using an overlay-dissimilarity metric.
View Article and Find Full Text PDFMolecular interaction fields provide a useful description of ligand binding propensity and have found widespread use in computer-aided drug design, for example, to characterize protein binding sites and in small molecular applications, such as three-dimensional quantitative structure-activity relationships, physicochemical property prediction, and virtual screening. However, the grids on which the field data are stored are typically very large, consisting of thousands of data points, which make them cumbersome to store and manipulate. The wavelet transform is a commonly used data compression technique, for example, in signal processing and image compression.
View Article and Find Full Text PDFBackground: It has been suggested that similarity searching using 2D fingerprints may not be suitable for scaffold hopping.
Methods: This article reports a detailed evaluation of the effectiveness of six common types of 2D fingerprints when they are used for scaffold-hopping similarity searches of the Molecular Design Limited Drug Data Report database, World of Molecular Bioactivity database and Maximum Unbiased Validation database.
Results: The results demonstrate that 2D fingerprints can be used for scaffold hopping, with novel scaffolds being identified in nearly every search that was carried out.
Molecular interaction fields such as those computed by the GRID program are widely used in applications such as virtual screening, molecular docking and 3D-QSAR modelling. They characterise molecules according to their favourable interaction sites and therefore enable predictions to be made on how molecules might interact. The fields are, however, comprised of a very large number of data points which presents difficulties for many applications.
View Article and Find Full Text PDFExperimental X-ray crystal structures and a database of calculated structural parameters of DNA octamers were used in combination to analyse the mechanics of DNA bending in the nucleosome core complex. The 1kx5 X-ray crystal structure of the nucleosome core complex was used to determine the relationship between local structure at the base-step level and the global superhelical conformation observed for nucleosome-bound DNA. The superhelix is characterised by a large curvature (597 degrees) in one plane and very little curvature (10 degrees) in the orthogonal plane.
View Article and Find Full Text PDFTwo methods are described for biasing conformational search during pharmacophore elucidation using a multiobjective genetic algorithm (MOGA). The MOGA explores conformation on-the-fly while simultaneously aligning a set of molecules such that their pharmacophoric features are maximally overlaid. By using a clique detection method to generate overlays of precomputed conformations to initialize the population (rather than starting from random), the speed of the algorithm has been increased by 2 orders of magnitude.
View Article and Find Full Text PDFChemical databases are routinely clustered, with the aim of grouping molecules which share similar structural features. Ideally, medicinal chemists are then able to browse a few representatives of the cluster in order to interpret the shared activity of the cluster members. However, when molecules are clustered using fingerprints, it may be difficult to decipher the structural commonalities which are present.
View Article and Find Full Text PDFRecent comparative studies of the human and mouse genomes have revealed sets of conserved nongenic sequences (CNGs) and sets of ultraconserved elements (UCEs). Both sets of sequences, which exhibit extremely high levels of conservation, extend over hundreds of bases and have no known function. Since there is no detectable sequence homology between paralogous CNGs or UCEs in either of the species, an alignment-free technique is needed for their analysis.
View Article and Find Full Text PDFStructural DNA profiles use the structural properties of the constituent octamers either to observe any characteristics of a single sequence that are unusual (a single sequence query) or to visualize a pattern common to a set of sequences (a multiple sequence query). They are an aid in understanding structural reasons for functional DNA activity. Profiles that answer single sequence queries are introduced and Profile Manager (a software application developed to automate profile generation) is presented.
View Article and Find Full Text PDFSimilarity-based methods for virtual screening are widely used. However, conventional searching using 2D chemical fingerprints or 2D graphs may retrieve only compounds which are structurally very similar to the original target molecule. Of particular current interest then is scaffold hopping, that is, the ability to identify molecules that belong to different chemical series but which could form the same interactions with a receptor.
View Article and Find Full Text PDFJ Chem Inf Model
October 2005
A crucial enabling technology for structural genomics is the development of algorithms that can predict the putative function of novel protein structures: the proposed functions can subsequently be experimentally tested by functional studies. Testable assignments of function can be made if it is possible to attribute a putative, or indeed probable, function on the basis of the shapes of the binding sites on the surface of a protein structure. However the comparison of the surfaces of 3D protein structures is a computationally demanding task.
View Article and Find Full Text PDFA database of the structural properties of all 32,896 unique DNA octamer sequences has been calculated, including information on stability, the minimum energy conformation and flexibility. The contents of the database have been analysed using a variety of Euclidean distance similarity measures. A global comparison of sequence similarity with structural similarity shows that the structural properties of DNA are much less diverse than the sequences, and that DNA sequence space is larger and more diverse than DNA structure space.
View Article and Find Full Text PDFWe have constructed the potential energy surfaces for all unique tetramers, hexamers and octamers in double helical DNA, as a function of the two principal degrees of freedom, slide and shift at the central step. From these potential energy maps, we have calculated a database of structural and flexibility properties for each of these sequences. These properties include: the values of each of the six step parameters (twist roll, tilt, rise, slide and shift), for each step of the sequence; flexibility measures for both decrease and increase in each property value from the minimum energy conformation for the central step; and the deviation from the path of a hypothetical straight octamer.
View Article and Find Full Text PDFAs part of the first Critical Assessment of PRotein Interactions, round 1, we predict the structure of two protein-protein complexes, by using a genetic algorithm, GAPDOCK, in combination with surface complementarity, buried surface area, biochemical information, and human intervention. Among the five models submitted for target 1, HPr phosphocarrier protein (B. subtilis) and the hexameric HPr kinase (L.
View Article and Find Full Text PDFReduced graphs provide summary representations of chemical structures. Here, a variety of different types of reduced graphs are compared in similarity searches. The reduced graphs are found to give comparable performance to Daylight fingerprints in terms of the number of active compounds retrieved.
View Article and Find Full Text PDFRecently a method (RASCAL) for determining graph similarity using a maximum common edge subgraph algorithm has been proposed which has proven to be very efficient when used to calculate the relative similarity of chemical structures represented as graphs. This paper describes heuristics which simplify a RASCAL similarity calculation by taking advantage of certain properties specific to chemical graph representations of molecular structure. These heuristics are shown experimentally to increase the efficiency of the algorithm, especially at more distant values of chemical graph similarity.
View Article and Find Full Text PDF