Data-Efficient, Chemistry-Aware Machine Learning Predictions of Diels-Alder Reaction Outcomes.

Angus Keto Taicheng Guo Morgan Underdue Thijs Stuyver Connor W Coley Xiangliang Zhang Elizabeth H Krenske Olaf Wiest

J Am Chem Soc

Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, Indiana 46556, United States.

Published: June 2024

The application of machine learning models to the prediction of reaction outcomes currently needs large and/or highly featurized data sets. We show that a chemistry-aware model, NERF, which mimics the bonding changes that occur during reactions, allows for highly accurate predictions of the outcomes of Diels-Alder reactions using a relatively small training set, with no pretraining and no additional features. We establish a diverse data set of 9537 intramolecular, hetero-, aromatic, and inverse electron demand Diels-Alder reactions. This data set is used to train a NERF model, and the performance is compared against state-of-the-art classification and generative machine learning models across low- and high-data regimes, with and without pretraining. The predictive accuracy (regio- and site selectivity in the major product) achieved by NERF exceeds 90% when as little as 40% of the data set is used for training. Another high-performing model, Chemformer, requires a larger training data set (>45%) and pretraining to reach 90% Top-1 accuracy. Accurate predictions of less-represented reaction subclasses, such as those involving heteroatomic or aromatic substrates, require higher percentages of training data. We also show how NERF can use small amounts of additional training data to quickly learn new systems and improve its overall understanding of reactivity. Synthetic chemists stand to benefit as this model can be rapidly expanded and tailored to areas of chemistry corresponding to the low-data regime.

Download full-text PDF	Source
http://dx.doi.org/10.1021/jacs.4c03131	DOI Listing

Publication Analysis

Top Keywords

data set

machine learning

training data

reaction outcomes

learning models

accurate predictions

diels-alder reactions

data

training

set

Similar Publications

ADELLE: A global testing method for trans-eQTL mapping.

PLoS Genet

January 2025

Department of Human Genetics, The University of Chicago, Chicago, Illinois, United States of America.

Takintayo Akinbiyi Mary Sara McPeek Mark Abney

Understanding the genetic regulatory mechanisms of gene expression is an ongoing challenge. Genetic variants that are associated with expression levels are readily identified when they are proximal to the gene (i.e.

View Article and Find Full Text PDF

Similar Publications

Fitting of Soil-Water Characteristic Curves (SWCC) of Bukit Mewah, Malaysia soil using field monitoring dataset.

PLoS One

January 2025

Department of Civil and Environmental Engineering, Nazarbayev University, Nur-Sultan, Kazakhstan.

Faris Shazani Suhaizan Aizat Mohd Taib Mohd Raihan Taha Dayang Zulaika Abang Hasbollah Aniza Ibrahim

Rainfall-induced landslides are a frequent geohazard for tropical regions with prevalent residual soils and year-round rainy seasons. The water infiltration into unsaturated soil can be analyzed using the soil-water characteristic curve (SWCC) and permeability function which can be used to monitor and predict incoming landslides, showing the necessity of selecting the appropriate model parameter while fitting the SWCC model. This paper presents a set of data from six different sections of the studied slope at varying depths that are used to test the performance of three SWCC models, the van Genuchten-Mualem (vG-M), Fredlund-Xing (F-X) and Gardner (G).

View Article and Find Full Text PDF

Similar Publications

Evaluation of four different standard addition approaches with respect to trueness and precision.

Anal Bioanal Chem

January 2025

Institute of Chemistry, Analytical Chemistry, University of Graz, Graz, Austria.

Gerhard Gössler Vera Hofer Walter Goessler

This work provides a statistical analysis of four different approaches suggested in the literature for the estimation of an unknown concentration based on data collected using the standard addition method. These approaches are the conventional extrapolation approach, the interpolation approach, inverse regression, and the normalization approach. These methods are compared under the assumption that the measurement errors are normally distributed and homoscedastic.

View Article and Find Full Text PDF

Similar Publications

The influence of environmental factors on the detection and quantification of SARS-CoV-2 variants in dormitory wastewater at a primarily undergraduate institution.

Microbiol Spectr

January 2025

Department of Biology, Appalachian State University, Boone, North Carolina, USA.

Chequita Brooks Sebrina Brooks Josie Beasley Jenna Valley Michael Opata

Unlabelled: Testing for the causative agent of coronavirus disease 2019 (COVID-19), severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has been crucial in tracking disease spread and informing public health decisions. Wastewater-based epidemiology has helped to alleviate some of the strain of testing through broader, population-level surveillance, and has been applied widely on college campuses. However, questions remain about the impact of various sampling methods, target types, environmental factors, and infrastructure variables on SARS-CoV-2 detection.

View Article and Find Full Text PDF

Similar Publications

Comparing two-sample log-linear exposure estimation with Bayesian model-informed precision dosing of tobramycin in adult patients with cystic fibrosis.

Antimicrob Agents Chemother

January 2025

InsightRX, San Francisco, California, USA.

Dominic M H Tong Maria-Stephanie A Hughes Jasmine Hu Jeffrey C Pearson David W Kubiak

Tobramycin dosing in patients with cystic fibrosis (CF) is challenged by its high pharmacokinetic (PK) variability and narrow therapeutic window. Doses are typically individualized using two-sample log-linear regression (LLR) to quantify the area under the concentration-time curve (AUC). Bayesian model-informed precision dosing (MIPD) may allow dose individualization with fewer samples; however, the relative performance of these methods is unknown.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!