Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites.

Erik D Huckvale Christian D Powell Huan Jin Hunter N B Moseley

Metabolites

Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA.

Published: November 2023

Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10673125	PMC
http://dx.doi.org/10.3390/metabo13111120	DOI Listing

Publication Analysis

Top Keywords

benchmark dataset

pathway involvement

machine learning

learning models

predict pathway

involvement metabolites

benchmark

models

dataset

dataset training

Similar Publications

EMS3D-KITTI: Synthetic 3D dataset in KITTI format with a fair distribution of Emergency Medical Services vehicles for autodrive AI model training.

Data Brief

February 2025

North Carolina Agricultural and Technical State University, 1601 E Market St, Greensboro, NC 27411, United States.

Chandra Jaiswal Sally Acquaah Christopher Nenebi Issa AlHmoud Akm Kamrul Islam

Contemporary research in 3D object detection for autonomous driving primarily focuses on identifying standard entities like vehicles and pedestrians. However, the need for large, precisely labelled datasets limits the detection of specialized and less common objects, such as Emergency Medical Service (EMS) and law enforcement vehicles. To address this, we leveraged the Car Learning to Act (CARLA) simulator to generate and fairly distribute rare EMS vehicles, automatically labelling these objects in 3D point cloud data.

View Article and Find Full Text PDF

Similar Publications

A comprehensive image dataset for the identification of lemon leaf diseases and computer vision applications.

Data Brief

February 2025

Department of CSE, Daffodil International University, Bangladesh.

A K M Fazlul Kobir Siam Prayma Bishshash Md Asraful Sharker Nirob Sajib Bin Mamun Md Assaduzzaman

A comprehensive dataset on lemon leaf disease can surely bring a lot of potentials into the development of agricultural research and the improvement of disease management strategies. This dataset was developed from 1354 raw images taken with professional agricultural specialist guidance from July to September 2024 in Charpolisha, Jamalpur, and further enhanced with augmented techniques, adding 9000 images. The augmentation process involves a set of techniques-flipping, rotation, zooming, shifting, adding noise, shearing, and brightening-to increase variety for different lemon leaf condition representations.

View Article and Find Full Text PDF

Similar Publications

A high-resolution three-year dataset supporting rooftop photovoltaics (PV) generation analytics.

Sci Data

January 2025

Sustainability/Net-Zero Office, The Hong Kong University of Science and Technology, Hong Kong SAR, China.

Zinan Lin Qi Zhou Zhe Wang Ce Wang Davis Boyd Bookhart

This paper presents an open-source dataset intended to enhance the analysis and optimization of photovoltaic (PV) power generation in urban environments, serving as a valuable resource for various applications in solar energy research and development. The dataset comprises measured PV power generation data and corresponding on-site weather data gathered from 60 grid-connected rooftop PV stations in Hong Kong over a three-year period (2021-2023). The PV power generation data was collected at 5-minute intervals at the inverter-level.

View Article and Find Full Text PDF

Similar Publications

A graph neural network-based model with out-of-distribution robustness for enhancing antiretroviral therapy outcome prediction for HIV-1.

Comput Med Imaging Graph

January 2025

Sapienza University of Rome, Department of Computer Control and Management Engineering Antonio Ruberti, 00185, Rome, Italy. Electronic address:

Giulia Di Teodoro Federico Siciliano Valerio Guarrasi Anne-Mieke Vandamme Valeria Ghisetti

Predicting the outcome of antiretroviral therapies (ART) for HIV-1 is a pressing clinical challenge, especially when the ART includes drugs with limited effectiveness data. This scarcity of data can arise either due to the introduction of a new drug to the market or due to limited use in clinical settings, resulting in clinical dataset with highly unbalanced therapy representation. To tackle this issue, we introduce a novel joint fusion model, which combines features from a Fully Connected (FC) Neural Network and a Graph Neural Network (GNN) in a multi-modality fashion.

View Article and Find Full Text PDF

Similar Publications

A multi-level feature fusion artificial neural network for classification of acoustic emission signals.

Ann N Y Acad Sci

January 2025

Hainan Institute, Zhejiang University, Sanya, China.

Jinliang Huang Zhaolin Zhu Zhihao Chen Haotian Lu Zijin Yang

In this paper, we introduce FUSION-ANN, a novel artificial neural network (ANN) designed for acoustic emission (AE) signal classification. FUSION-ANN comprises four distinct ANN branches, each housing an independent multilayer perceptron. We extract denoised features of speech recognition such as linear predictive coding, Mel-frequency cepstral coefficient, and gammatone cepstral coefficient to represent AE signals.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!