AI Article Synopsis

  • Supervised machine learning models are used to predict diseases but face challenges with class imbalance in training, prompting the use of a conditional normalizing flow model for better predictions.
  • This study utilized health records from 706 South Korean individuals, focusing on six chronic diseases, particularly evaluating the model's performance in classifying diabetes which had a low occurrence rate (about 2%).
  • Results showed that the conditional normalizing flow model outperformed traditional supervised models, achieving better metrics for classifying diabetes and other chronic diseases, indicating its effectiveness in addressing class imbalance in medical data.

Article Abstract

Background: Supervised machine learning models have been widely used to predict and get insight into diseases by classifying patients based on personal health records. However, a class imbalance is an obstacle that disrupts the training of the models. In this study, we aimed to address class imbalance with a conditional normalizing flow model, one of the deep-learning-based semi-supervised models for anomaly detection. It is the first introduction of the normalizing flow algorithm for tabular biomedical data.

Methods: We collected personal health records from South Korean citizens (n = 706), featuring genetic data obtained from direct-to-customer service (microarray chip), medical health check-ups, and lifestyle log data. Based on the health check-up data, six chronic diseases were labeled (obesity, diabetes, hypertriglyceridemia, dyslipidemia, liver dysfunction, and hypertension). After preprocessing, supervised classification models and semi-supervised anomaly detection models, including conditional normalizing flow, were evaluated for the classification of diabetes, which had extreme target imbalance (about 2%), based on AUROC and AUPRC. In addition, we evaluated their performance under the assumption of insufficient collection for patients with other chronic diseases by undersampling disease-affected samples.

Results: While LightGBM (the best-performing model among supervised classification models) showed AUPRC 0.16 and AUROC 0.82, conditional normalizing flow achieved AUPRC 0.34 and AUROC 0.83 during fifty evaluations of the classification of diabetes, whose base rate was very low, at 0.02. Moreover, conditional normalizing flow performed better than the supervised model under a few disease-affected data numbers for the other five chronic diseases - obesity, hypertriglyceridemia, dyslipidemia, liver dysfunction, and hypertension. For example, while LightGBM performed AUPRC 0.20 and AUROC 0.75, conditional normalizing flow showed AUPRC 0.30 and AUROC 0.74 when predicting obesity, while undersampling disease-affected samples (positive undersampling) lowered the base rate to 0.02.

Conclusions: Our research suggests the utility of conditional normalizing flow, particularly when the available cases are limited, for predicting chronic diseases using personal health records. This approach offers an effective solution to deal with sparse data and extreme class imbalances commonly encountered in the biomedical context.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11127363PMC
http://dx.doi.org/10.1186/s13040-024-00366-0DOI Listing

Publication Analysis

Top Keywords

normalizing flow
32
conditional normalizing
28
personal health
16
health records
16
chronic diseases
16
class imbalance
12
machine learning
8
normalizing
8
flow
8
extreme class
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!