Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10673125PMC
http://dx.doi.org/10.3390/metabo13111120DOI Listing

Publication Analysis

Top Keywords

benchmark dataset
24
pathway involvement
12
machine learning
8
learning models
8
predict pathway
8
involvement metabolites
8
benchmark
6
models
5
dataset
5
dataset training
4

Similar Publications

Contemporary research in 3D object detection for autonomous driving primarily focuses on identifying standard entities like vehicles and pedestrians. However, the need for large, precisely labelled datasets limits the detection of specialized and less common objects, such as Emergency Medical Service (EMS) and law enforcement vehicles. To address this, we leveraged the Car Learning to Act (CARLA) simulator to generate and fairly distribute rare EMS vehicles, automatically labelling these objects in 3D point cloud data.

View Article and Find Full Text PDF

A comprehensive dataset on lemon leaf disease can surely bring a lot of potentials into the development of agricultural research and the improvement of disease management strategies. This dataset was developed from 1354 raw images taken with professional agricultural specialist guidance from July to September 2024 in Charpolisha, Jamalpur, and further enhanced with augmented techniques, adding 9000 images. The augmentation process involves a set of techniques-flipping, rotation, zooming, shifting, adding noise, shearing, and brightening-to increase variety for different lemon leaf condition representations.

View Article and Find Full Text PDF

This paper presents an open-source dataset intended to enhance the analysis and optimization of photovoltaic (PV) power generation in urban environments, serving as a valuable resource for various applications in solar energy research and development. The dataset comprises measured PV power generation data and corresponding on-site weather data gathered from 60 grid-connected rooftop PV stations in Hong Kong over a three-year period (2021-2023). The PV power generation data was collected at 5-minute intervals at the inverter-level.

View Article and Find Full Text PDF

Predicting the outcome of antiretroviral therapies (ART) for HIV-1 is a pressing clinical challenge, especially when the ART includes drugs with limited effectiveness data. This scarcity of data can arise either due to the introduction of a new drug to the market or due to limited use in clinical settings, resulting in clinical dataset with highly unbalanced therapy representation. To tackle this issue, we introduce a novel joint fusion model, which combines features from a Fully Connected (FC) Neural Network and a Graph Neural Network (GNN) in a multi-modality fashion.

View Article and Find Full Text PDF

In this paper, we introduce FUSION-ANN, a novel artificial neural network (ANN) designed for acoustic emission (AE) signal classification. FUSION-ANN comprises four distinct ANN branches, each housing an independent multilayer perceptron. We extract denoised features of speech recognition such as linear predictive coding, Mel-frequency cepstral coefficient, and gammatone cepstral coefficient to represent AE signals.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!