Molecular flexibility is a commonly used, but not easily quantified term. It is at the core of understanding composition and size of a conformational ensemble and contributes to many molecular properties. For many computational workflows, it is necessary to reduce a conformational ensemble to meaningful representatives, however defining them and guaranteeing the ensemble's completeness is difficult.
View Article and Find Full Text PDFRecently, we presented a method to assign atomic partial charges based on the DASH (dynamic attention-based substructure hierarchy) tree with high efficiency and quantum mechanical (QM)-like accuracy. In addition, the approach can be considered "rule based"-where the rules are derived from the attention values of a graph neural network-and thus, each assignment is fully explainable by visualizing the underlying molecular substructures. In this work, we demonstrate that these hierarchically sorted substructures capture the key features of the local environment of an atom and allow us to predict different atomic properties with high accuracy without building a new DASH tree for each property.
View Article and Find Full Text PDFHere, we present lwreg, a lightweight, yet flexible chemical registration system supporting the capture of both two-dimensional molecular structures (topologies) and three-dimensional conformers. lwreg is open source, with a simple Python API, and is designed to be easily integrated into computational workflows. In addition to lwreg itself, we also introduce a straightforward schema for storing experimental data and metadata in the registration database.
View Article and Find Full Text PDFJ Chem Inf Model
March 2024
As part of the ongoing quest to find or construct large data sets for use in validating new machine learning (ML) approaches for bioactivity prediction, it has become distressingly common for researchers to combine literature IC data generated using different assays into a single data set. It is well-known that there are many situations where this is a scientifically risky thing to do, even when the assays are against exactly the same target, but the risks of assays being incompatible are even higher when pulling data from large collections of literature data like ChEMBL. Here, we estimate the amount of noise present in combined data sets using cases where measurements for the same compound are reported in multiple assays against the same target.
View Article and Find Full Text PDFTime-split cross-validation is broadly recognized as the gold standard for validating predictive models intended for use in medicinal chemistry projects. Unfortunately this type of data is not broadly available outside of large pharmaceutical research organizations. Here we introduce the SIMPD (simulated medicinal chemistry project data) algorithm to split public data sets into training and test sets that mimic the differences observed in real-world medicinal chemistry project data sets.
View Article and Find Full Text PDFWe present a robust and computationally efficient approach for assigning partial charges of atoms in molecules. The method is based on a hierarchical tree constructed from attention values extracted from a graph neural network (GNN), which was trained to predict atomic partial charges from accurate quantum-mechanical (QM) calculations. The resulting dynamic attention-based substructure hierarchy (DASH) approach provides fast assignment of partial charges with the same accuracy as the GNN itself, is software-independent, and can easily be integrated in existing parametrization pipelines, as shown for the Open force field (OpenFF).
View Article and Find Full Text PDFNuclear magnetic resonance (NMR) data from NOESY (nuclear Overhauser enhancement spectroscopy) and ROESY (rotating frame Overhauser enhancement spectroscopy) experiments can easily be combined with distance geometry (DG) based conformer generators by modifying the molecular distance bounds matrix. In this work, we extend the modern DG based conformer generator ETKDG, which has been shown to reproduce experimental crystal structures from small molecules to large macrocycles well, to include NOE-derived interproton distances. In noeETKDG, the experimentally derived interproton distances are incorporated into the distance bounds matrix as loose upper (or lower) bounds to generate large conformer sets.
View Article and Find Full Text PDFMachine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.
View Article and Find Full Text PDFWe present an implementation of the scaffold network in the open source cheminformatics toolkit RDKit. Scaffold networks have been introduced in the literature as a powerful method to navigate and analyze large screening data sets in medicinal chemistry. Such a network can be created by iteratively applying predefined fragmentation rules to the investigated set of small molecules and by linking the produced fragments according to their descendence.
View Article and Find Full Text PDFThe conformer generator ETKDG is a stochastic search method that utilizes distance geometry together with knowledge derived from experimental crystal structures. It has been shown to generate good conformers for acyclic, flexible molecules. This work builds on ETKDG to improve conformer generation of molecules containing small or large aliphatic (i.
View Article and Find Full Text PDFOpen-source workflows have become more and more an integral part of computer-aided drug design (CADD) projects since they allow reproducible and shareable research that can be easily transferred to other projects. Setting up, understanding, and applying such workflows involves either coding or using workflow managers that offer a graphical user interface. We previously reported the TeachOpenCADD teaching platform that provides interactive Jupyter Notebooks (talktorials) on central CADD topics using open-source data and Python packages.
View Article and Find Full Text PDFThe first challenge in the 2014 competition launched by the Teach-Discover-Treat (TDT) initiative asked for the development of a tutorial for ligand-based virtual screening, based on data from a primary phenotypic high-throughput screen (HTS) against malaria. The resulting Workflows were applied to select compounds from a commercial database, and a subset of those were purchased and tested experimentally for anti-malaria activity. Here, we present the two most successful Workflows, both using machine-learning approaches, and report the results for the 114 compounds tested in the follow-up screen.
View Article and Find Full Text PDFExperiments in the life sciences often involve tools from a variety of domains such as mass spectrometry, next generation sequencing, or image processing. Passing the data between those tools often involves complex scripts for controlling data flow, data transformation, and statistical analysis. Such scripts are not only prone to be platform dependent, they also tend to grow as the experiment progresses and are seldomly well documented, a fact that hinders the reproducibility of the experiment.
View Article and Find Full Text PDFBig data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods.
View Article and Find Full Text PDFJ Chem Inf Model
December 2016
When analyzing chemical reactions it is essential to know which molecules are actively involved in the reaction and which educts will form the product molecules. Assigning reaction roles, like reactant, reagent, or product, to the molecules of a chemical reaction might be a trivial problem for hand-curated reaction schemes but it is more difficult to automate, an essential step when handling large amounts of reaction data. Here, we describe a new fingerprint-based and data-driven approach to assign reaction roles which is also applicable to rather unbalanced and noisy reaction schemes.
View Article and Find Full Text PDFMultiple recent studies have focused on unraveling the content of the medicinal chemist's toolbox. Here, we present an investigation of chemical reactions and molecules retrieved from U.S.
View Article and Find Full Text PDFJ Chem Inf Model
December 2015
Small organic molecules are often flexible, i.e., they can adopt a variety of low-energy conformations in solution that exist in equilibrium with each other.
View Article and Find Full Text PDFFinding a canonical ordering of the atoms in a molecule is a prerequisite for generating a unique representation of the molecule. The canonicalization of a molecule is usually accomplished by applying some sort of graph relaxation algorithm, the most common of which is the Morgan algorithm. There are known issues with that algorithm that lead to noncanonical atom orderings as well as problems when it is applied to large molecules like proteins.
View Article and Find Full Text PDFFingerprint methods applied to molecules have proven to be useful for similarity determination and as inputs to machine-learning models. Here, we present the development of a new fingerprint for chemical reactions and validate its usefulness in building machine-learning models and in similarity assessment. Our final fingerprint is constructed as the difference of the atom-pair fingerprints of products and reactants and includes agents via calculated physicochemical properties.
View Article and Find Full Text PDFModern high-throughput screening (HTS) is a well-established approach for hit finding in drug discovery that is routinely employed in the pharmaceutical industry to screen more than a million compounds within a few weeks. However, as the industry shifts to more disease-relevant but more complex phenotypic screens, the focus has moved to piloting smaller but smarter chemically/biologically diverse subsets followed by an expansion around hit compounds. One standard method for doing this is to train a machine-learning (ML) model with the chemical fingerprints of the tested subset of molecules and then select the next compounds based on the predictions of this model.
View Article and Find Full Text PDFThe concept of data fusion - the combination of information from different sources describing the same object with the expectation to generate a more accurate representation - has found application in a very broad range of disciplines. In the context of ligand-based virtual screening (VS), data fusion has been applied to combine knowledge from either different active molecules or different fingerprints to improve similarity search performance. Machine-learning (ML) methods based on fusion of multiple homogeneous classifiers, in particular random forests, have also been widely applied in the ML literature.
View Article and Find Full Text PDF: Fingerprint similarity is a common method for comparing chemical structures. Similarity is an appealing approach because, with many fingerprint types, it provides intuitive results: a chemist looking at two molecules can understand why they have been determined to be similar. This transparency is partially lost with the fuzzier similarity methods that are often used for scaffold hopping and tends to vanish completely when molecular fingerprints are used as inputs to machine-learning (ML) models.
View Article and Find Full Text PDF: Similarity-search methods using molecular fingerprints are an important tool for ligand-based virtual screening. A huge variety of fingerprints exist and their performance, usually assessed in retrospective benchmarking studies using data sets with known actives and known or assumed inactives, depends largely on the validation data sets used and the similarity measure used. Comparing new methods to existing ones in any systematic way is rather difficult due to the lack of standard data sets and evaluation procedures.
View Article and Find Full Text PDF