A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data.

J Biomed Inform

School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China.

Published: July 2020

The problem of imbalanced data classification often exists in medical diagnosis. Traditional classification algorithms usually assume that the number of samples in each class is similar and their misclassification cost during training is equal. However, the misclassification cost of patient samples is higher than that of healthy person samples. Therefore, how to increase the identification of patients without affecting the classification of healthy individuals is an urgent problem. In order to solve the problem of imbalanced data classification in medical diagnosis, we propose a hybrid sampling algorithm called RFMSE, which combines the Misclassification-oriented Synthetic minority over-sampling technique (M-SMOTE) and Edited nearset neighbor (ENN) based on Random forest (RF). The algorithm is mainly composed of three parts. First, M-SMOTE is used to increase the number of samples in the minority class, while the over-sampling rate of M-SMOTE is the misclassification rate of RF. Then, ENN is used to remove the noise ones from the majority samples. Finally, RF is used to perform classification prediction for the samples after hybrid sampling, and the stopping criterion for iterations is determined according to the changes of the classification index (i.e. Matthews Correlation Coefficient (MCC)). When the value of MCC continuously drops, the process of iterations will be stopped. Extensive experiments conducted on ten UCI datasets demonstrate that RFMSE can effectively solve the problem of imbalanced data classification. Compared with traditional algorithms, our method can improve F-value and MCC more effectively.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jbi.2020.103465DOI Listing

Publication Analysis

Top Keywords

imbalanced data
16
hybrid sampling
12
problem imbalanced
12
data classification
12
sampling algorithm
8
enn based
8
based random
8
random forest
8
medical diagnosis
8
number samples
8

Similar Publications

Marine pollution due to oil spills presents major risks to coastal areas and aquatic life, leading to serious environmental health concerns. Oil Spill detection using SAR data has transitioned from traditional segmentation to a variety of machine learning & deep learning models like UNET proving its efficiency for the task. This research paper proposes a GSCAT-UNET model for efficient oil spill detection and discrimination from lookalikes.

View Article and Find Full Text PDF

In the Imbalanced Multivariate Time Series Classification (ImMTSC) task, minority-class instances typically correspond to critical events, such as system faults in power grids or abnormal health occurrences in medical monitoring. Despite being rare and random, these events are highly significant. The dynamic spatial-temporal relationships between minority-class instances and other instances make them more prone to interference from neighboring instances during classification.

View Article and Find Full Text PDF

Machine Learning-Based Alzheimer's Disease Stage Diagnosis Utilizing Blood Gene Expression and Clinical Data: A Comparative Investigation.

Diagnostics (Basel)

January 2025

Department of Computer Science and Engineering, Faculty of Engineering and Technology, Technology Campus (Peenya Campus), Ramaiah University of Applied Sciences, Bengaluru 560058, India.

This study presents a comparative analysis of the multistage diagnosis of Alzheimer's disease (AD), including mild cognitive impairment (MCI), utilizing two distinct types of biomarkers: blood gene expression and clinical biomarker samples. Both of these samples, obtained from participants in the Alzheimer's Disease Neuroimaging Initiative (ADNI), were independently analyzed utilizing machine learning (ML)-based multiclassifiers. This study applied novel machine learning-based data augmentation techniques to gene expression profile data that are high-dimensional, low-sample-size (HDLSS) and inherently highly imbalanced.

View Article and Find Full Text PDF

The availability of drugs for stable COPD treatment in China: a cross-sectional survey.

NPJ Prim Care Respir Med

January 2025

Department of Pulmonary and Critical Care Medicine, West China Hospital, Sichuan University, Chengdu, 610041, China.

This survey aimed to investigate the availability of drugs for stable chronic obstructive pulmonary disease (COPD) treatment in Chinese hospitals and to determine whether drug availability significantly varied among hospitals with different characteristics. A well-constructed questionnaire was designed according to the Chinese Guidelines for the Diagnosis and Management of COPD (revised version 2021). Both inhaled drugs (monotherapy, double therapy and triple therapy) and oral drugs (expectorants, theophylline, antibiotics, and bacterial lysates) were included in this survey.

View Article and Find Full Text PDF

Unveiling diabetes onset: Optimized XGBoost with Bayesian optimization for enhanced prediction.

PLoS One

January 2025

Department of Computer Science and Information Systems, College of Applied Sciences, AlMaarefa University, Ad Diriyah, Riyadh, Kingdom of Saudi Arabia.

Diabetes, a chronic condition affecting millions worldwide, necessitates early intervention to prevent severe complications. While accurately predicting diabetes onset or progression remains challenging due to complex and imbalanced datasets, recent advancements in machine learning offer potential solutions. Traditional prediction models, often limited by default parameters, have been superseded by more sophisticated approaches.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!