A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement.

PLoS One

Markey Cancer Center, University of Kentucky, Lexington, Kentucky, United States of America.

Published: May 2024

The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11065254	PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0299583	PLOS

Publication Analysis

Top Keywords

machine learning

kegg-smiles dataset

learning approaches

k-fold cross-validation

cautionary tale

tale properly

properly vetting

metabolite-to-pathway mappings

training dataset

prior published

Similar Publications

Microfluidic and Computational Tools for Neurodegeneration Studies.

Annu Rev Chem Biomol Eng

January 2025

1Department of Chemical & Biomolecular Engineering, North Carolina State University, Raleigh, North Carolina, USA; email:

Kin Gomez Victoria R Yarmey Hrishikesh Mane Adriana San-Miguel

Understanding the molecular, cellular, and physiological components of neurodegenerative diseases (NDs) is paramount for developing accurate diagnostics and efficacious therapies. However, the complexity of ND pathology and the limitations associated with conventional analytical methods undermine research. Fortunately, microfluidic technology can facilitate discoveries through improved biomarker quantification, brain organoid culture, and small animal model manipulation.

View Article and Find Full Text PDF

Similar Publications

A Novel Preoperative Scoring System to Accurately Predict Cord-Level Intraoperative Neuromonitoring Data Loss During Spinal Deformity Surgery: A Machine-Learning Approach.

J Bone Joint Surg Am

November 2024

Department of Orthopedic Surgery, Columbia University Irving Medical Center, New York, NY.

Nathan J Lee Lawrence G Lenke Varun Arvind Ted Shi Alexandra C Dionne

Background: An accurate knowledge of a patient's risk of cord-level intraoperative neuromonitoring (IONM) data loss is important for an informed decision-making process prior to deformity correction, but no prediction tool currently exists.

Methods: A total of 1,106 patients with spinal deformity and 205 perioperative variables were included. A stepwise machine-learning (ML) approach using random forest (RF) analysis and multivariable logistic regression was performed.

View Article and Find Full Text PDF

Similar Publications

Correction to: Circulating miRNAs and Machine Learning for Lateralizing Primary Aldosteronism.

Hypertension

February 2025

View Article and Find Full Text PDF

Similar Publications

Improving early prediction of crop yield in Spanish olive groves using satellite imagery and machine learning.

PLoS One

January 2025

Department of Computer Science, University of Jaén, Jaén, Spain.

M Isabel Ramos Juan J Cubillas Ruth M Córdoba Lidia M Ortega

In the production sector, the usefulness of predictive systems as a tool for management and decision-making is well known. In the agricultural sector, a correct economic balance of the farm depends on making the right decisions. For this purpose, having information in advance on crop yields is an extraordinary help.

View Article and Find Full Text PDF

Similar Publications

A novel multi-user collaborative cognitive radio spectrum sensing model: Based on a CNN-LSTM model.

PLoS One

January 2025

School of Electronic Information Engineering, Inner Mongolia University, Hohhot, Inner Mongolia, China.

Kai Wang Yangyang Chen Dan Bo Shubin Wang

Cognitive Radio (CR) technology enables wireless devices to learn about their surrounding spectrum environment through sensing capabilities, thereby facilitating efficient spectrum utilization without interfering with the normal operation of licensed users. This study aims to enhance spectrum sensing in multi-user cooperative cognitive radio systems by leveraging a hybrid model that combines Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. A novel multi-user cooperative spectrum sensing model is developed, utilizing CNN's local feature extraction capability and LSTM's advantage in handling sequential data to optimize sensing accuracy and efficiency.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!