Interpretability and fairness evaluation of deep learning models on MIMIC-IV dataset.

Chuizheng Meng Loc Trinh Nan Xu James Enouen Yan Liu

Sci Rep

Department of Computer Science, University of Southern California, Los Angeles, CA, 90089, USA.

Published: May 2022

Recent advancements in large healthcare datasets have enhanced the research of deep learning models, but raise concerns about interpretability, fairness, and biases, especially when human lives are involved.
The study focuses on the MIMIC-IV dataset to analyze in-hospital mortality prediction models, revealing issues with model interpretability and biases, particularly regarding demographic features affecting fairness in predictions.
Key findings indicate that while certain interpretability methods can highlight important features for predictions, they also show reliance on demographic factors, leading to disparate treatment and unfair predictions across different patient groups based on ethnicity, gender, and age.

The recent release of large-scale healthcare datasets has greatly propelled the research of data-driven deep learning models for healthcare applications. However, due to the nature of such deep black-boxed models, concerns about interpretability, fairness, and biases in healthcare scenarios where human lives are at stake call for a careful and thorough examination of both datasets and models. In this work, we focus on MIMIC-IV (Medical Information Mart for Intensive Care, version IV), the largest publicly available healthcare dataset, and conduct comprehensive analyses of interpretability as well as dataset representation bias and prediction fairness of deep learning models for in-hospital mortality prediction. First, we analyze the interpretability of deep learning mortality prediction models and observe that (1) the best-performing interpretability method successfully identifies critical features for mortality prediction on various prediction models as well as recognizing new important features that domain knowledge does not consider; (2) prediction models rely on demographic features, raising concerns in fairness. Therefore, we then evaluate the fairness of models and do observe the unfairness: (1) there exists disparate treatment in prescribing mechanical ventilation among patient groups across ethnicity, gender and age; (2) models often rely on racial attributes unequally across subgroups to generate their predictions. We further draw concrete connections between interpretability methods and fairness metrics by showing how feature importance from interpretability methods can be beneficial in quantifying potential disparities in mortality predictors. Our analysis demonstrates that the prediction performance is not the only factor to consider when evaluating models for healthcare applications, since high prediction performance might be the result of unfair utilization of demographic features. Our findings suggest that future research in AI models for healthcare applications can benefit from utilizing the analysis workflow of interpretability and fairness as well as verifying if models achieve superior performance at the cost of introducing bias.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9065125	PMC
http://dx.doi.org/10.1038/s41598-022-11012-2	DOI Listing

Publication Analysis

Top Keywords

deep learning

models

interpretability fairness

learning models

models healthcare

healthcare applications

mortality prediction

prediction models

interpretability

prediction

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!