Treating Semiempirical Hamiltonians as Flexible Machine Learning Models Yields Accurate and Interpretable Results.

J Chem Theory Comput

Department of Chemistry, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States.

Published: September 2023

Quantum chemistry provides chemists with invaluable information, but the high computational cost limits the size and type of systems that can be studied. Machine learning (ML) has emerged as a means to dramatically lower the cost while maintaining high accuracy. However, ML models often sacrifice interpretability by using components such as the artificial neural networks of deep learning that function as black boxes. These components impart the flexibility needed to learn from large volumes of data but make it difficult to gain insight into the physical or chemical basis for the predictions. Here, we demonstrate that semiempirical quantum chemical (SEQC) models can learn from large volumes of data without sacrificing interpretability. The SEQC model is that of density-functional-based tight binding (DFTB) with fixed atomic orbital energies and interactions that are one-dimensional functions of the interatomic distance. This model is trained to data in a manner that is analogous to that used to train deep learning models. Using benchmarks that reflect the accuracy of the training data, we show that the resulting model maintains a physically reasonable functional form while achieving an accuracy, relative to coupled cluster energies with a complete basis set extrapolation (CCSD(T)*/CBS), that is comparable to that of density functional theory (DFT). This suggests that trained SEQC models can achieve a low computational cost and high accuracy without sacrificing interpretability. Use of a physically motivated model form also substantially reduces the amount of data needed to train the model compared to that required for deep learning models.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10536991PMC
http://dx.doi.org/10.1021/acs.jctc.3c00491DOI Listing

Publication Analysis

Top Keywords

learning models
12
deep learning
12
machine learning
8
computational cost
8
high accuracy
8
learn large
8
large volumes
8
volumes data
8
seqc models
8
sacrificing interpretability
8

Similar Publications

Background: Pancreatic cancer is characterized by a complex tumor microenvironment that hinders effective immunotherapy. Identifying key factors that regulate the immunosuppressive landscape is crucial for improving treatment strategies.

Methods: We constructed a prognostic and risk assessment model for pancreatic cancer using 101 machine learning algorithms, identifying OSBPL3 as a key gene associated with disease progression and prognosis.

View Article and Find Full Text PDF

Background: Urinary tract infection (UTI) is a frequent health-threatening condition. Early reliable diagnosis of UTI helps to prevent misuse or overuse of antibiotics and hence prevent antibiotic resistance. The gold standard for UTI diagnosis is urine culture which is a time-consuming and also an error prone method.

View Article and Find Full Text PDF

A machine learning model accurately identifies glycogen storage disease Ia patients based on plasma acylcarnitine profiles.

Orphanet J Rare Dis

January 2025

Laboratory of Metabolic Diseases, Department of Laboratory Medicine, University Medical Center Groningen, University of Groningen, Hanzeplein 1, Postbus, Groningen, 30001 - 9700 RB, the Netherlands.

Background: Glycogen storage disease (GSD) Ia is an ultra-rare inherited disorder of carbohydrate metabolism. Patients often present in the first months of life with fasting hypoketotic hypoglycemia and hepatomegaly. The diagnosis of GSD Ia relies on a combination of different biomarkers, mostly routine clinical chemical markers and subsequent genetic confirmation.

View Article and Find Full Text PDF

Background: Patients supported by extracorporeal membrane oxygenation (ECMO) are at a high risk of brain injury, contributing to significant morbidity and mortality. This study aimed to employ machine learning (ML) techniques to predict brain injury in pediatric patients ECMO and identify key variables for future research.

Methods: Data from pediatric patients undergoing ECMO were collected from the Chinese Society of Extracorporeal Life Support (CSECLS) registry database and local hospitals.

View Article and Find Full Text PDF

Explainable unsupervised anomaly detection for healthcare insurance data.

BMC Med Inform Decis Mak

January 2025

Department of Electrical Engineering, ESAT-STADIUS, KU Leuven, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium.

Background: Waste and fraud are important problems for health insurers to deal with. With the advent of big data, these insurers are looking more and more towards data mining and machine learning methods to help in detecting waste and fraud. However, labeled data is costly and difficult to acquire as it requires expert investigators and known care providers with atypical behavior.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!