The application of machine learning models to the prediction of reaction outcomes currently needs large and/or highly featurized data sets. We show that a chemistry-aware model, NERF, which mimics the bonding changes that occur during reactions, allows for highly accurate predictions of the outcomes of Diels-Alder reactions using a relatively small training set, with no pretraining and no additional features. We establish a diverse data set of 9537 intramolecular, hetero-, aromatic, and inverse electron demand Diels-Alder reactions. This data set is used to train a NERF model, and the performance is compared against state-of-the-art classification and generative machine learning models across low- and high-data regimes, with and without pretraining. The predictive accuracy (regio- and site selectivity in the major product) achieved by NERF exceeds 90% when as little as 40% of the data set is used for training. Another high-performing model, Chemformer, requires a larger training data set (>45%) and pretraining to reach 90% Top-1 accuracy. Accurate predictions of less-represented reaction subclasses, such as those involving heteroatomic or aromatic substrates, require higher percentages of training data. We also show how NERF can use small amounts of additional training data to quickly learn new systems and improve its overall understanding of reactivity. Synthetic chemists stand to benefit as this model can be rapidly expanded and tailored to areas of chemistry corresponding to the low-data regime.

Download full-text PDF

Source
http://dx.doi.org/10.1021/jacs.4c03131DOI Listing

Publication Analysis

Top Keywords

data set
16
machine learning
12
training data
12
reaction outcomes
8
learning models
8
accurate predictions
8
diels-alder reactions
8
data
7
training
5
set
5

Similar Publications

ADELLE: A global testing method for trans-eQTL mapping.

PLoS Genet

January 2025

Department of Human Genetics, The University of Chicago, Chicago, Illinois, United States of America.

Understanding the genetic regulatory mechanisms of gene expression is an ongoing challenge. Genetic variants that are associated with expression levels are readily identified when they are proximal to the gene (i.e.

View Article and Find Full Text PDF

Rainfall-induced landslides are a frequent geohazard for tropical regions with prevalent residual soils and year-round rainy seasons. The water infiltration into unsaturated soil can be analyzed using the soil-water characteristic curve (SWCC) and permeability function which can be used to monitor and predict incoming landslides, showing the necessity of selecting the appropriate model parameter while fitting the SWCC model. This paper presents a set of data from six different sections of the studied slope at varying depths that are used to test the performance of three SWCC models, the van Genuchten-Mualem (vG-M), Fredlund-Xing (F-X) and Gardner (G).

View Article and Find Full Text PDF

This work provides a statistical analysis of four different approaches suggested in the literature for the estimation of an unknown concentration based on data collected using the standard addition method. These approaches are the conventional extrapolation approach, the interpolation approach, inverse regression, and the normalization approach. These methods are compared under the assumption that the measurement errors are normally distributed and homoscedastic.

View Article and Find Full Text PDF

Unlabelled: Testing for the causative agent of coronavirus disease 2019 (COVID-19), severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has been crucial in tracking disease spread and informing public health decisions. Wastewater-based epidemiology has helped to alleviate some of the strain of testing through broader, population-level surveillance, and has been applied widely on college campuses. However, questions remain about the impact of various sampling methods, target types, environmental factors, and infrastructure variables on SARS-CoV-2 detection.

View Article and Find Full Text PDF

Tobramycin dosing in patients with cystic fibrosis (CF) is challenged by its high pharmacokinetic (PK) variability and narrow therapeutic window. Doses are typically individualized using two-sample log-linear regression (LLR) to quantify the area under the concentration-time curve (AUC). Bayesian model-informed precision dosing (MIPD) may allow dose individualization with fewer samples; however, the relative performance of these methods is unknown.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!