Melting point prediction employing k-nearest neighbor algorithms and genetic parameter optimization.

J Chem Inf Model

Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, United Kingdom.

Published: February 2007

We have applied the k-nearest neighbor (kNN) modeling technique to the prediction of melting points. A data set of 4119 diverse organic molecules (data set 1) and an additional set of 277 drugs (data set 2) were used to compare performance in different regions of chemical space, and we investigated the influence of the number of nearest neighbors using different types of molecular descriptors. To compute the prediction on the basis of the melting temperatures of the nearest neighbors, we used four different methods (arithmetic and geometric average, inverse distance weighting, and exponential weighting), of which the exponential weighting scheme yielded the best results. We assessed our model via a 25-fold Monte Carlo cross-validation (with approximately 30% of the total data as a test set) and optimized it using a genetic algorithm. Predictions for drugs based on drugs (separate training and test sets each taken from data set 2) were found to be considerably better [root-mean-squared error (RMSE)=46.3 degrees C, r2=0.30] than those based on nondrugs (prediction of data set 2 based on the training set from data set 1, RMSE=50.3 degrees C, r2=0.20). The optimized model yields an average RMSE as low as 46.2 degrees C (r2=0.49) for data set 1, and an average RMSE of 42.2 degrees C (r2=0.42) for data set 2. It is shown that the kNN method inherently introduces a systematic error in melting point prediction. Much of the remaining error can be attributed to the lack of information about interactions in the liquid state, which are not well-captured by molecular descriptors.

Download full-text PDF

Source
http://dx.doi.org/10.1021/ci060149fDOI Listing

Publication Analysis

Top Keywords

data set
32
set
11
data
9
melting point
8
point prediction
8
k-nearest neighbor
8
nearest neighbors
8
molecular descriptors
8
weighting exponential
8
exponential weighting
8

Similar Publications

Background: In data-sparse areas such as health care, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning models' outputs. As a standard, categorical data, such as patients' gender, socioeconomic status, or skin color, are used to train models in fusion with other data types, such as medical images and text-based medical information. However, the effects of including categorical data features for model training in such data-scarce areas are underexamined, particularly regarding models intended to serve individuals equitably in a diverse population.

View Article and Find Full Text PDF

Introduction: High-Flow Nasal Therapy (HFNT) is an innovative non-invasive form of respiratory support. Compared to standard oxygen therapy (SOT), there is an equipoise regarding the effect of HFNT on patient-centred outcomes among those at high risk of developing postoperative pulmonary complications after undergoing cardiac surgery. The NOTACS trial aims to determine the clinical and cost-effectiveness of HFNT compared to SOT within 90 days of surgery in the United Kingdom, Australia, and New Zealand.

View Article and Find Full Text PDF

QUEST#4X: An Extension of QUEST#4 for Benchmarking Multireference Wave Function Methods.

J Chem Theory Comput

January 2025

Qingdao Institute for Theoretical and Computational Sciences and Center for Optics Research and Engineering, Shandong University, Qingdao 266237, China.

Given a number of data sets for evaluating the performance of single reference methods for the low-lying excited states of closed-shell molecules, a comprehensive data set for assessing the performance of multireference methods for the low-lying excited states of open-shell systems is still lacking. For this reason, we propose an extension (QUEST#4X) of the radical subset of QUEST#4 ( , , 3720) to cover 110 doublet and 39 quartet excited states. Near-exact results obtained by iterative configuration interaction with selection and second-order perturbation correction (iCIPT2) are taken as benchmark to calibrate static-dynamic-static configuration interaction (SDSCI) and static-dynamic-static second-order perturbation theory (SDSPT2), which are minimal MRCI and CI-like perturbation theory, respectively.

View Article and Find Full Text PDF

Improving Bond Dissociations of Reactive Machine Learning Potentials through Physics-Constrained Data Augmentation.

J Chem Inf Model

January 2025

Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.

In the field of computational chemistry, predicting bond dissociation energies (BDEs) presents well-known challenges, particularly due to the multireference character of reactive systems. Many chemical reactions involve configurations where single-reference methods fall short, as the electronic structure can significantly change during bond breaking. As generating training data for partially broken bonds is a challenging task, even state-of-the-art reactive machine learning interatomic potentials (MLIPs) often fail to predict reliable BDEs and smooth dissociation curves.

View Article and Find Full Text PDF

Neuropathic pain is a debilitating complication following spinal cord injury (SCI). Currently, effective treatments for SCI-induced neuropathic pain are highly lacking. This clinical trial aimed to investigate the efficacy of combined intrathecal injection of Schwann cells (SCs) and bone marrow-derived mesenchymal stem cells (BMSCs) in improving SCI-induced neuropathic pain.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!