Background: Machine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine. Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target. Problematically, ML models and their predictions can be biased by confounding information present in the features. To remove this spurious signal, researchers often employ featurewise linear confound regression (CR). While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood.
Results: We provide new evidence that, contrary to general expectations, linear confound regression can increase the risk of confounding when combined with nonlinear ML approaches. Using a simple framework that uses the target as a confound, we show that information leaked via CR can increase null or moderate effects to near-perfect prediction. By shuffling the features, we provide evidence that this increase is indeed due to confound-leakage and not due to revealing of information. We then demonstrate the danger of confound-leakage in a real-world clinical application where the accuracy of predicting attention-deficit/hyperactivity disorder is overestimated using speech-derived features when using depression as a confound.
Conclusions: Mishandling or even amplifying confounding effects when building ML models due to confound-leakage, as shown, can lead to untrustworthy, biased, and unfair predictions. Our expose of the confound-leakage pitfall and provided guidelines for dealing with it can help create more robust and trustworthy ML models.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10541796 | PMC |
http://dx.doi.org/10.1093/gigascience/giad071 | DOI Listing |
Am J Emerg Med
January 2025
Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, USA; Center for Outcomes Research and Evaluation, Yale University, New Haven, CT, USA.
Background: This study aimed to examine how physician performance metrics are affected by the speed of other attendings (co-attendings) concurrently staffing the ED.
Methods: A retrospective study was conducted using patient data from two EDs between January-2018 and February-2020. Machine learning was used to predict patient length of stay (LOS) conditional on being assigned a physician of average speed, using patient- and departmental-level variables.
JMIR Med Inform
January 2025
Department of Science and Education, Shenzhen Baoan Women's and Children's Hospital, Shenzhen, China.
Background: Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain-specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored.
View Article and Find Full Text PDFJMIR AI
January 2025
Department of Radiology, Children's National Hospital, Washington, DC, United States.
J Chem Theory Comput
January 2025
Guizhou Provincial Engineering Technology Research Center for Chemical Drug R&D, School of Pharmacy, Guizhou Medical University, Guiyang, Guizhou 550025, P. R. China.
Traditional machine learning methods face significant challenges in predicting the properties of highly symmetric molecules. In this study, we developed a machine learning model based on graph neural networks (GNNs) to accurately and swiftly predict the thermodynamic and photochemical properties of fullerenols, such as C(OH) ( = 1 to 30). First, we established a global method for generating fullerenol isomers through isomer fingerprinting, which can generate all possible isomers or produce diverse structural types on demand.
View Article and Find Full Text PDFPLoS One
January 2025
Dirección General de Minería, República Dominicana.
This study investigates the geochemical characteristics of rare earth elements (REEs) in highland karstic bauxite deposits located in the Sierra de Bahoruco, Pedernales Province, Dominican Republic. These deposits, formed through intense weathering of volcanic material, represent a potentially valuable REE resource for the nation. Surface and subsurface soil samples were analyzed using portable X-ray fluorescence (pXRF) and a NixPro 2 color sensor validated with inductively coupled plasma optical emission spectrometry (ICP-OES).
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!