Background: Machine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine. Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target. Problematically, ML models and their predictions can be biased by confounding information present in the features. To remove this spurious signal, researchers often employ featurewise linear confound regression (CR). While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood.

Results: We provide new evidence that, contrary to general expectations, linear confound regression can increase the risk of confounding when combined with nonlinear ML approaches. Using a simple framework that uses the target as a confound, we show that information leaked via CR can increase null or moderate effects to near-perfect prediction. By shuffling the features, we provide evidence that this increase is indeed due to confound-leakage and not due to revealing of information. We then demonstrate the danger of confound-leakage in a real-world clinical application where the accuracy of predicting attention-deficit/hyperactivity disorder is overestimated using speech-derived features when using depression as a confound.

Conclusions: Mishandling or even amplifying confounding effects when building ML models due to confound-leakage, as shown, can lead to untrustworthy, biased, and unfair predictions. Our expose of the confound-leakage pitfall and provided guidelines for dealing with it can help create more robust and trustworthy ML models.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10541796PMC
http://dx.doi.org/10.1093/gigascience/giad071DOI Listing

Publication Analysis

Top Keywords

machine learning
8
linear confound
8
confound regression
8
provide evidence
8
confound-leakage
5
confound-leakage confound
4
confound removal
4
removal machine
4
learning leads
4
leads leakage
4

Similar Publications

Who is coming in? Evaluation of physician performance within multi-physician emergency departments.

Am J Emerg Med

January 2025

Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, USA; Center for Outcomes Research and Evaluation, Yale University, New Haven, CT, USA.

Background: This study aimed to examine how physician performance metrics are affected by the speed of other attendings (co-attendings) concurrently staffing the ED.

Methods: A retrospective study was conducted using patient data from two EDs between January-2018 and February-2020. Machine learning was used to predict patient length of stay (LOS) conditional on being assigned a physician of average speed, using patient- and departmental-level variables.

View Article and Find Full Text PDF

Background: Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain-specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored.

View Article and Find Full Text PDF

Prediction of Thermodynamic Properties of C-Based Fullerenols Using Machine Learning.

J Chem Theory Comput

January 2025

Guizhou Provincial Engineering Technology Research Center for Chemical Drug R&D, School of Pharmacy, Guizhou Medical University, Guiyang, Guizhou 550025, P. R. China.

Traditional machine learning methods face significant challenges in predicting the properties of highly symmetric molecules. In this study, we developed a machine learning model based on graph neural networks (GNNs) to accurately and swiftly predict the thermodynamic and photochemical properties of fullerenols, such as C(OH) ( = 1 to 30). First, we established a global method for generating fullerenol isomers through isomer fingerprinting, which can generate all possible isomers or produce diverse structural types on demand.

View Article and Find Full Text PDF

This study investigates the geochemical characteristics of rare earth elements (REEs) in highland karstic bauxite deposits located in the Sierra de Bahoruco, Pedernales Province, Dominican Republic. These deposits, formed through intense weathering of volcanic material, represent a potentially valuable REE resource for the nation. Surface and subsurface soil samples were analyzed using portable X-ray fluorescence (pXRF) and a NixPro 2 color sensor validated with inductively coupled plasma optical emission spectrometry (ICP-OES).

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!