Background: For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer's disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to high variability in disease patterns. Further variability originates from multicentric study designs, varying acquisition protocols, and errors in the preprocessing of magnetic resonance imaging (MRI) scans. The high variability makes the differentiation between signal and noise difficult and may lead to overfitting. This article examines whether an automatic and fair data valuation method based on Shapley values can identify the most informative subjects to improve ML classification.

Methods: An ML workflow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workflow included volumetric MRI feature extraction, feature selection, sample selection using Data Shapley, random forest (RF), and eXtreme Gradient Boosting (XGBoost) for model training as well as Kernel SHapley Additive exPlanations (SHAP) values for model interpretation.

Results: The RF models, which excluded 134 of the 467 training subjects based on their RF Data Shapley values, outperformed the base models that reached a mean accuracy of 62.64% by 5.76% (3.61 percentage points) for the independent ADNI test set. The XGBoost base models reached a mean accuracy of 60.00% for the AIBL data set. The exclusion of those 133 subjects with the smallest RF Data Shapley values could improve the classification accuracy by 2.98% (1.79 percentage points). The cutoff values were calculated using an independent validation set.

Conclusion: The Data Shapley method was able to improve the mean accuracies for the test sets. The most informative subjects were associated with the number of ApolipoproteinE ε4 (ApoE ε4) alleles, cognitive test results, and volumetric MRI measurements.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8444618PMC
http://dx.doi.org/10.1186/s13195-021-00879-4DOI Listing

Publication Analysis

Top Keywords

shapley values
16
data shapley
16
alzheimer's disease
12
data
8
machine learning
8
high variability
8
informative subjects
8
independent adni
8
adni test
8
test set
8

Similar Publications

Prognostic value of multi-PLD ASL radiomics in acute ischemic stroke.

Front Neurol

January 2025

Department of Radiology, Affiliated Hospital 6 of Nantong University, Yancheng Third People's Hospital, Yancheng, Jiangsu, China.

Introduction: Early prognosis prediction of acute ischemic stroke (AIS) can support clinicians in choosing personalized treatment plans. The aim of this study is to develop a machine learning (ML) model that uses multiple post-labeling delay times (multi-PLD) arterial spin labeling (ASL) radiomics features to achieve early and precise prediction of AIS prognosis.

Methods: This study enrolled 102 AIS patients admitted between December 2020 and September 2024.

View Article and Find Full Text PDF

To retrospectively develop and validate an interpretable deep learning model and nomogram utilizing endoscopic ultrasound (EUS) images to predict pancreatic neuroendocrine tumors (PNETs). Following confirmation via pathological examination, a retrospective analysis was performed on a cohort of 266 patients, comprising 115 individuals diagnosed with PNETs and 151 with pancreatic cancer. These patients were randomly assigned to the training or test group in a 7:3 ratio.

View Article and Find Full Text PDF

Drug-induced autoimmunity (DIA) is a non-IgE immune-related adverse drug reaction that poses substantial challenges in predictive toxicology due to its idiosyncratic nature, complex pathogenesis, and diverse clinical manifestations. To address these challenges, we developed InterDIA, an interpretable machine learning framework for predicting DIA toxicity based on molecular physicochemical properties. Multi-strategy feature selection and advanced ensemble resampling approaches were integrated to enhance prediction accuracy and overcome data imbalance.

View Article and Find Full Text PDF

Single-omics approaches often provide a limited view of complex biological systems, whereas multiomics integration offers a more comprehensive understanding by combining diverse data views. However, integrating heterogeneous data types and interpreting the intricate relationships between biological features-both within and across different data views-remains a bottleneck. To address these challenges, we introduce COSIME (Cooperative Multi-view Integration and Scalable Interpretable Model Explainer).

View Article and Find Full Text PDF

This study evaluates three Machine Learning (ML) models-Temporal Kolmogorov-Arnold Networks (TKAN), Long Short-Term Memory (LSTM), and Temporal Convolutional Networks (TCN)-focusing on their capabilities to improve prediction accuracy and efficiency in streamflow forecasting. We adopt a data-centric approach, utilizing large, validated datasets to train the models, and apply SHapley Additive exPlanations (SHAP) to enhance the interpretability and reliability of the ML models. The results show that TKAN outperforms LSTM but slightly lags behind TCN in streamflow forecasting.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!