AI Article Synopsis

  • The study compares the classification effectiveness of various statistical models on a dataset related to chronic kidney disease (CKD) using the National Health Insurance Service database in Korea.
  • It evaluates different machine learning methods, including multinomial logistic regression, ordinal logistic regression, random forest, and autoencoder, focusing on the accurate classification of CKD stages based on glomerular filtration rate (GFR).
  • Results reveal that the autoencoder model outperforms others in correctly classifying CKD stages, particularly when considering multiple performance metrics like accuracy, sensitivity, and precision, especially in situations with imbalanced data.

Article Abstract

This study aims to compare the classification performance of statistical models on highly imbalanced kidney data. The health examination cohort database provided by the National Health Insurance Service in Korea is utilized to build models with various machine learning methods. The glomerular filtration rate (GFR) is used to diagnose chronic kidney disease (CKD). It is calculated using the Modification of Diet in Renal Disease method and classified into five stages (1, 2, 3A and 3B, 4, and 5). Different CKD stages based on the estimated GFR are considered as six classes of the response variable. This study utilizes two representative generalized linear models for classification, namely, multinomial logistic regression (multinomial LR) and ordinal logistic regression (ordinal LR), as well as two machine learning models, namely, random forest (RF) and autoencoder (AE). The classification performance of the four models is compared in terms of accuracy, sensitivity, specificity, precision, and F1-Measure. To find the best model that classifies CKD stages correctly, the data are divided into a 10-fold dataset with the same rate for each CKD stage. Results indicate that RF and AE show better performance in accuracy than the multinomial and ordinal LR models when classifying the response variable. However, when a highly imbalanced dataset is modeled, the accuracy of the model performance can distort the actual performance. This occurs because accuracy is high even if a statistical model classifies a minority class into a majority class. To solve this problem in performance interpretation, we not only consider accuracy from the confusion matrix but also sensitivity, specificity, precision, and F-1 measure for each class. To present classification performance with a single value for each model, we calculate the macro-average and micro-weighted values for each model. We conclude that AE is the best model classifying CKD stages correctly for all performance indices.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7345590PMC
http://dx.doi.org/10.3390/diagnostics10060415DOI Listing

Publication Analysis

Top Keywords

machine learning
12
highly imbalanced
12
classification performance
12
ckd stages
12
statistical models
8
models machine
8
learning methods
8
kidney data
8
performance
8
response variable
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!