Publications by authors named "Connor Coley"

Data-driven reaction discovery and development is a growing field that relies on the use of molecular descriptors to capture key information about substrates, ligands, and targets. Broad adaptation of this strategy is hindered by the associated computational cost of descriptor calculation, especially when considering conformational flexibility. Descriptor libraries can be precomputed agnostic of application to reduce the computational burden of data-driven reaction development.

View Article and Find Full Text PDF

Recently, a new class of synthetic methyl methacrylate-based random heteropolymers (MMA-based RHPs) has displayed protein-like properties. Their function appears to be insensitive to the precise sequence. Here, through atomistic molecular dynamics simulation, we show that there are universal protein-like features of MMA-based RHPs that are insensitive to the sequence, and mostly depend on the overall composition.

View Article and Find Full Text PDF

The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation.

View Article and Find Full Text PDF

Artificial intelligence (AI) is accelerating how we conduct science, from folding proteins with AlphaFold and summarizing literature findings with large language models, to annotating genomes and prioritizing newly generated molecules for screening using specialized software. However, the application of AI to emulate human cognition in natural product research and its subsequent impact has so far been limited. One reason for this limited impact is that available natural product data is multimodal, unbalanced, unstandardized, and scattered across many data repositories.

View Article and Find Full Text PDF

Mechanistic understanding of organic reactions can facilitate reaction development, impurity prediction, and in principle, reaction discovery. While several machine learning models have sought to address the task of predicting reaction products, their extension to predicting reaction mechanisms has been impeded by the lack of a corresponding mechanistic dataset. In this study, we construct such a dataset by imputing intermediates between experimentally reported reactants and products using expert reaction templates and train several machine learning models on the resulting dataset of 5,184,184 elementary steps.

View Article and Find Full Text PDF

Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level.

View Article and Find Full Text PDF

Small molecules exhibiting desirable property profiles are often discovered through an iterative process of designing, synthesizing and testing sets of molecules. The selection of molecules to synthesize from all possible candidates is a complex decision-making process that typically relies on expert chemist intuition. Here we propose a quantitative decision-making framework, SPARROW, that prioritizes molecules for evaluation by balancing expected information gain and synthetic cost.

View Article and Find Full Text PDF

The application of machine learning models to the prediction of reaction outcomes currently needs large and/or highly featurized data sets. We show that a chemistry-aware model, NERF, which mimics the bonding changes that occur during reactions, allows for highly accurate predictions of the outcomes of Diels-Alder reactions using a relatively small training set, with no pretraining and no additional features. We establish a diverse data set of 9537 intramolecular, hetero-, aromatic, and inverse electron demand Diels-Alder reactions.

View Article and Find Full Text PDF

Despite the increased use of computational tools to supplement medicinal chemists' expertise and intuition in drug design, predicting synthetic yields in medicinal chemistry endeavors remains an unsolved challenge. Existing design workflows could profoundly benefit from reaction yield prediction, as precious material waste could be reduced, and a greater number of relevant compounds could be delivered to advance the design, make, test, analyze (DMTA) cycle. In this work, we detail the evaluation of AbbVie's medicinal chemistry library data set to build machine learning models for the prediction of Suzuki coupling reaction yields.

View Article and Find Full Text PDF

SMARTS is a widely used language in cheminformatics for defining substructural queries for database lookups, reaction templates for chemical transformations, and other applications. As an extension to SMILES, many SMARTS patterns can represent the same query. Despite this, no canonicalization algorithm invariant of the line notation sequence or atomic numbering is publicly available.

View Article and Find Full Text PDF

The accurate prediction of tandem mass spectra from molecular structures has the potential to unlock new metabolomic discoveries by augmenting the community's libraries of experimental reference standards. Cheminformatic spectrum prediction strategies use a "bond-breaking" framework to iteratively simulate mass spectrum fragmentations, but these methods are (a) slow due to the need to exhaustively and combinatorially break molecules and (b) inaccurate as they often rely upon heuristics to predict the intensity of each resulting fragment; neural network alternatives mitigate computational cost but are black-box and not inherently more accurate. We introduce a physically grounded neural approach that learns to predict each breakage event and score the most relevant subset of molecular fragments quickly and accurately.

View Article and Find Full Text PDF

Models can codify our understanding of chemical reactivity and serve a useful purpose in the development of new synthetic processes via, for example, evaluating hypothetical reaction conditions or in silico substrate tolerance. Perhaps the most determining factor is the composition of the training data and whether it is sufficient to train a model that can make accurate predictions over the full domain of interest. Here, we discuss the design of reaction datasets in ways that are conducive to data-driven modeling, emphasizing the idea that training set diversity and model generalizability rely on the choice of molecular or reaction representation.

View Article and Find Full Text PDF
Article Synopsis
  • Scientists are trying to create small molecules that help certain proteins stick together, which can change how cells work.
  • They made about 1 million special compounds using DNA to find out which ones can connect two chosen proteins, specifically VHL and bromodomains.
  • By testing these compounds, they discovered some that could make the bromodomains disappear in cells and even got to see how one of the best compounds interacted with the proteins in a crystal structure.
View Article and Find Full Text PDF

We unveil a unified view on the effect of side chains on the glass transition temperatures () in polymer melts by using molecular dynamics simulations, density functional theory calculations, and available experimental data. We use acrylates as a model system and evaluate the effect of -alkyl side chains on . We find that backbone dihedral angle fluctuations follow established patterns due to sterics, as expected.

View Article and Find Full Text PDF

Chemical formula annotation for tandem mass spectrometry (MS/MS) data is the first step toward structurally elucidating unknown metabolites. While great strides have been made toward solving this problem, the current state-of-the-art method depends on time-intensive, proprietary, and expert-parametrized fragmentation tree construction and scoring. In this work, we extend our previous spectrum Transformer methodology into an energy-based modeling framework, MIST-CF: Metabolite Inference with Spectrum Transformers for Chemical Formula prediction, for learning to rank chemical formula and adduct assignments given an unannotated MS/MS spectrum.

View Article and Find Full Text PDF

Diversity-oriented synthesis (DOS) is a powerful strategy to prepare molecules with underrepresented features in commercial screening collections, resulting in the elucidation of novel biological mechanisms. In parallel to the development of DOS, DNA-encoded libraries (DELs) have emerged as an effective, efficient screening strategy to identify protein binders. Despite recent advancements in this field, most DEL syntheses are limited by the presence of sensitive DNA-based constructs.

View Article and Find Full Text PDF
Article Synopsis
  • Artificial intelligence is revolutionizing scientific discovery by enhancing research processes such as hypothesis generation, experiment design, and data interpretation.
  • Recent advances like self-supervised learning and geometric deep learning are improving model accuracy by utilizing vast amounts of unlabelled data and incorporating the structure of scientific data.
  • While generative AI is helping create innovations like drugs and proteins, challenges like data quality and the need for better understanding among AI developers and users persist, highlighting areas for further progress in AI research.
View Article and Find Full Text PDF

Molecular quantum mechanical modeling, accelerated by machine learning, has opened the door to high-throughput screening campaigns of complex properties, such as the activation energies of chemical reactions and absorption/emission spectra of materials and molecules; in silico. Here, we present an overview of the main principles, concepts, and design considerations involved in such hybrid computational quantum chemistry/machine learning screening workflows, with a special emphasis on some recent examples of their successful application. We end with a brief outlook of further advances that will benefit the field.

View Article and Find Full Text PDF
Article Synopsis
  • Liquid handling robots often face compatibility issues due to proprietary programming interfaces, hindering method sharing and development.
  • PyLabRobot is an open-source Python interface that works across various liquid-handling robots like Hamilton STARs and Tecan EVOs, offering a universal command set for controlling different labware and devices.
  • The system has been validated with unit tests and practical applications like a browser-based simulator and calibration tools, creating a flexible environment for lab automation.
View Article and Find Full Text PDF

The past decade has seen a number of impressive developments in predictive chemistry and reaction informatics driven by machine learning applications to computer-aided synthesis planning. While many of these developments have been made even with relatively small, bespoke data sets, in order to advance the role of AI in the field at scale, there must be significant improvements in the reporting of reaction data. Currently, the majority of publicly available data is reported in an unstructured format and heavily imbalanced toward high-yielding reactions, which influences the types of models that can be successfully trained.

View Article and Find Full Text PDF