Beyond XGBoost and SHAP: Unveiling true feature importance.

J Hazard Mater

Faculty of Data Science, Musashino University, 3-3-3 Ariake Koto-ku, Tokyo 135-8181, Japan. Electronic address:

Published: January 2025

This paper outlines key machine learning principles, focusing on the use of XGBoost and SHAP values to assist researchers in avoiding analytical pitfalls. XGBoost builds models by incrementally adding decision trees, each addressing the errors of the previous one, which can result in inflated feature importance scores due to the method's emphasis on misclassified examples. While SHAP values provide a theoretically robust way to interpret predictions, their dependence on model structure and feature interactions can introduce biases. The lack of ground truth values complicates model evaluation, as biased feature importance can obscure real relationships with target variables. Ground truth values, representing the actual labels used in model training and validation, are crucial for improving predictive accuracy, serving as benchmarks for comparing model outcomes to true results. However, they do not ensure real associations between features and targets. Instead, they help gauge the model's effectiveness in achieving high accuracy. This paper underscores the necessity for researchers to recognize biases in feature importance and model evaluation, advocating for the use of rigorous statistical methods to enhance the reliability of analyses in machine learning research.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jhazmat.2025.137382DOI Listing

Publication Analysis

Top Keywords

xgboost shap
8
machine learning
8
shap values
8
ground truth
8
truth values
8
model evaluation
8
feature
5
model
5
shap unveiling
4
unveiling true
4

Similar Publications

Background: Acute Stanford Type A aortic dissection (AAD-type A) and acute myocardial infarction (AMI) present with similar symptoms but require distinct treatments. Efficient differentiation is critical due to limited access to radiological equipment in many primary healthcare. This study develops a multimodal deep learning model integrating electrocardiogram (ECG) signals and laboratory indicators to enhance diagnostic accuracy for AAD-type A and AMI.

View Article and Find Full Text PDF

Beyond XGBoost and SHAP: Unveiling true feature importance.

J Hazard Mater

January 2025

Faculty of Data Science, Musashino University, 3-3-3 Ariake Koto-ku, Tokyo 135-8181, Japan. Electronic address:

This paper outlines key machine learning principles, focusing on the use of XGBoost and SHAP values to assist researchers in avoiding analytical pitfalls. XGBoost builds models by incrementally adding decision trees, each addressing the errors of the previous one, which can result in inflated feature importance scores due to the method's emphasis on misclassified examples. While SHAP values provide a theoretically robust way to interpret predictions, their dependence on model structure and feature interactions can introduce biases.

View Article and Find Full Text PDF

Background: Skip lymph node metastasis (SLNM) in papillary thyroid cancer (PTC) involves cancer cells bypassing central nodes to directly metastasize to lateral nodes, often undetected by standard preoperative ultrasonography. Although multiple models exist to identify SLNM, they are inadequate for clinically node-negative (cN0) patients, resulting in underestimated metastatic risks and compromised treatment effectiveness. Our study aims to develop and validate a machine learning (ML) model that combines elastography radiomics with clinicopathological data to predict pre-surgical SLNM risk in cN0 PTC patients with increased risk of lymph node metastasis (LNM), improving their treatment strategies.

View Article and Find Full Text PDF

Background: Spontaneous intracerebral hemorrhage (SICH) is the second most common cause of cerebrovascular disease after ischemic stroke, with high mortality and disability rates, imposing a significant economic burden on families and society. This retrospective study aimed to develop and evaluate an interpretable machine learning model to predict functional outcomes 3 months after SICH.

Methods: A retrospective analysis was conducted on clinical data from 380 patients with SICH who were hospitalized at three different centers between June 2020 and June 2023.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!