Data analysis with Shapley values for automatic subject selection in Alzheimer's disease data sets using interpretable machine learning.

Alzheimers Res Ther

Department of Computer Science, University of Applied Sciences and Arts Dortmund, Dortmund, 44227, Germany.

Published: September 2021

Background: For the recruitment and monitoring of subjects for therapy studies, it is important to predict whether mild cognitive impaired (MCI) subjects will prospectively develop Alzheimer's disease (AD). Machine learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to high variability in disease patterns. Further variability originates from multicentric study designs, varying acquisition protocols, and errors in the preprocessing of magnetic resonance imaging (MRI) scans. The high variability makes the differentiation between signal and noise difficult and may lead to overfitting. This article examines whether an automatic and fair data valuation method based on Shapley values can identify the most informative subjects to improve ML classification.

Methods: An ML workflow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workflow included volumetric MRI feature extraction, feature selection, sample selection using Data Shapley, random forest (RF), and eXtreme Gradient Boosting (XGBoost) for model training as well as Kernel SHapley Additive exPlanations (SHAP) values for model interpretation.

Results: The RF models, which excluded 134 of the 467 training subjects based on their RF Data Shapley values, outperformed the base models that reached a mean accuracy of 62.64% by 5.76% (3.61 percentage points) for the independent ADNI test set. The XGBoost base models reached a mean accuracy of 60.00% for the AIBL data set. The exclusion of those 133 subjects with the smallest RF Data Shapley values could improve the classification accuracy by 2.98% (1.79 percentage points). The cutoff values were calculated using an independent validation set.

Conclusion: The Data Shapley method was able to improve the mean accuracies for the test sets. The most informative subjects were associated with the number of ApolipoproteinE ε4 (ApoE ε4) alleles, cognitive test results, and volumetric MRI measurements.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8444618	PMC
http://dx.doi.org/10.1186/s13195-021-00879-4	DOI Listing

Publication Analysis

Top Keywords

shapley values

data shapley

alzheimer's disease

data

machine learning

high variability

informative subjects

independent adni

adni test

test set

Similar Publications

Prognostic value of multi-PLD ASL radiomics in acute ischemic stroke.

Front Neurol

January 2025

Department of Radiology, Affiliated Hospital 6 of Nantong University, Yancheng Third People's Hospital, Yancheng, Jiangsu, China.

Zhenyu Wang Yuan Shen Xianxian Zhang Qingqing Li Congsong Dong

Introduction: Early prognosis prediction of acute ischemic stroke (AIS) can support clinicians in choosing personalized treatment plans. The aim of this study is to develop a machine learning (ML) model that uses multiple post-labeling delay times (multi-PLD) arterial spin labeling (ASL) radiomics features to achieve early and precise prediction of AIS prognosis.

Methods: This study enrolled 102 AIS patients admitted between December 2020 and September 2024.

View Article and Find Full Text PDF

Similar Publications

An endoscopic ultrasound-based interpretable deep learning model and nomogram for distinguishing pancreatic neuroendocrine tumors from pancreatic cancer.

Sci Rep

January 2025

Gastroenterology Department, The First Affiliated Hospital of Guangxi Medical University, Nanning, China.

Nan Yi Shuangyang Mo Yan Zhang Qi Jiang Yingwei Wang

To retrospectively develop and validate an interpretable deep learning model and nomogram utilizing endoscopic ultrasound (EUS) images to predict pancreatic neuroendocrine tumors (PNETs). Following confirmation via pathological examination, a retrospective analysis was performed on a cohort of 266 patients, comprising 115 individuals diagnosed with PNETs and 151 with pancreatic cancer. These patients were randomly assigned to the training or test group in a 7:3 ratio.

View Article and Find Full Text PDF

Similar Publications

InterDIA: Interpretable Prediction of Drug-induced Autoimmunity through Ensemble Machine Learning Approaches.

Toxicology

January 2025

Deparment of clinical pharmacy, Jieyang People's Hospital, 522000, China. Electronic address:

Lina Huang Peineng Liu Xiaojie Huang

Drug-induced autoimmunity (DIA) is a non-IgE immune-related adverse drug reaction that poses substantial challenges in predictive toxicology due to its idiosyncratic nature, complex pathogenesis, and diverse clinical manifestations. To address these challenges, we developed InterDIA, an interpretable machine learning framework for predicting DIA toxicity based on molecular physicochemical properties. Multi-strategy feature selection and advanced ensemble resampling approaches were integrated to enhance prediction accuracy and overcome data imbalance.

View Article and Find Full Text PDF

Similar Publications

COSIME: Cooperative multi-view integration and Scalable and Interpretable Model Explainer.

bioRxiv

January 2025

Jerome J Choi Noah Cohen Kalafut Tim Gruenloh Corinne D Engelman Tianyuan Lu

Single-omics approaches often provide a limited view of complex biological systems, whereas multiomics integration offers a more comprehensive understanding by combining diverse data views. However, integrating heterogeneous data types and interpreting the intricate relationships between biological features-both within and across different data views-remains a bottleneck. To address these challenges, we introduce COSIME (Cooperative Multi-view Integration and Scalable Interpretable Model Explainer).

View Article and Find Full Text PDF

Similar Publications

Effectiveness of three machine learning models for prediction of daily streamflow and uncertainty assessment.

Water Res X

May 2025

Institute for Artificial Intelligence R&D of Serbia, Fruškogorska 1, Novi Sad 21000, Serbia.

Luka Vinokić Milan Dotlić Veljko Prodanović Slobodan Kolaković Slobodan P Simonovic

This study evaluates three Machine Learning (ML) models-Temporal Kolmogorov-Arnold Networks (TKAN), Long Short-Term Memory (LSTM), and Temporal Convolutional Networks (TCN)-focusing on their capabilities to improve prediction accuracy and efficiency in streamflow forecasting. We adopt a data-centric approach, utilizing large, validated datasets to train the models, and apply SHapley Additive exPlanations (SHAP) to enhance the interpretability and reliability of the ML models. The results show that TKAN outperforms LSTM but slightly lags behind TCN in streamflow forecasting.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!