We have developed an algorithm to generate a new spectra-based descriptor, called SpectraFP, in order to digitalize the chemical shifts of C NMR spectra, as well as potentially important data from other spectroscopic techniques. This descriptor is a fingerprint vector with defined sizes and values of 0 and 1, with the ability to correct chemical shift fluctuations. To explore the applicability of SpectraFP, we outlined two application scenarios: (1) the prediction of six functional groups by machine learning (ML) models and (2) the search for structures based on the similarity between the query spectrum and spectra in an experimental database, both in the SpectraFP format. For each functional group, five ML models were built and validated following the OECD principles: internal and external validations, applicability domains, and mechanistic interpretations. All the models resulted in high goodness-of-fit for the training and test sets with MCC respectively between 0.626 and 0.909 and 0.653 and 0.917, and ranging from 0.812 to 0.957 and 0.825 to 0.961. Using the SHAP (SHapley Additive exPlanations) approach, the mechanistic interpretations of the models were explored; the results indicated that the most important variables for model decision making were coherent with the expected chemical shifts for each functional group. Several metrics, including Tanimoto, geometric, arithmetic, and Tversky, can be used to perform the similarity calculation for the search algorithm. This algorithm can also incorporate additional variables, such as the correction parameter and the difference between the amount of signals in the query spectrum and the database spectra, while preserving its high performance speed. We hope that our descriptor can link information from spectroscopic/spectrometric techniques with ML models to expand the possibilities in understanding the field of cheminformatics. All databases and algorithms developed for this work are open sources and freely accessible.

Download full-text PDF

Source
http://dx.doi.org/10.1039/d3cp00734kDOI Listing

Publication Analysis

Top Keywords

spectra-based descriptor
8
search algorithm
8
chemical shifts
8
query spectrum
8
functional group
8
mechanistic interpretations
8
interpretations models
8
models
5
spectrafp
4
spectrafp spectra-based
4

Similar Publications

Advanced Mass-Spectra-Based Machine Learning for Predicting the Toxicity of Traditional Chinese Medicines.

Anal Chem

December 2024

Institute of Environmental Research at Greater Bay Area, Key Laboratory for Water Quality and Conservation of the Pearl River Delta, Ministry of Education, Guangzhou University, Guangzhou 510006, China.

Traditional Chinese medicine (TCM) has been a cornerstone of health care for centuries, valued for its preventive and therapeutic properties. However, recent decades have revealed significant toxicological concerns associated with TCMs due to their complex chemical compositions. Traditional QSAR (quantitative structure-activity relationships) models, which predict toxicity based on chemical structures, face challenges with the intricate nature of TCM compounds.

View Article and Find Full Text PDF

We have developed an algorithm to generate a new spectra-based descriptor, called SpectraFP, in order to digitalize the chemical shifts of C NMR spectra, as well as potentially important data from other spectroscopic techniques. This descriptor is a fingerprint vector with defined sizes and values of 0 and 1, with the ability to correct chemical shift fluctuations. To explore the applicability of SpectraFP, we outlined two application scenarios: (1) the prediction of six functional groups by machine learning (ML) models and (2) the search for structures based on the similarity between the query spectrum and spectra in an experimental database, both in the SpectraFP format.

View Article and Find Full Text PDF

Differential mass spectrometry correlated with quantum chemical calculations (QCC-ΔMS) has been shown to be an efficient tool for the chemical structure identification (CSI) of isomers with similar mass spectra. For this type of analysis, we report here a new strategy based on ordering (ORD), linear correlation (LCOR) algorithms, and their coupling, to filter the most probable structures corresponding to similar mass spectra belonging to a group with dozens of isomers (e.g.

View Article and Find Full Text PDF

On the use of 1H and 13C 1D NMR spectra as QSPR descriptors.

J Chem Inf Model

September 2006

Institute for Molecules and Materials, Radboud University Nijmegen, Toernooiveld 1, NL-6525 ED Nijmegen, The Netherlands.

Recently, 1D NMR and IR spectra have been proposed as descriptors containing 3D information. And, as such, said to be suitable for making QSAR and QSPR models where 3D molecular geometries matter, for example, in binding affinities. This paper presents a study on the predictive power of 1D NMR spectra-based QSPR models using simulated proton and carbon 1D NMR spectra.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!