Unlabelled: Handling missing values is a crucial step in preprocessing data in Machine Learning. Most available algorithms for analyzing datasets in the feature selection process and classification or estimation process analyze complete datasets. Consequently, in many cases, the strategy for dealing with missing values is to use only instances with full data or to replace missing values with a mean, mode, median, or a constant value. Usually, discarding missing samples or replacing missing values by means of fundamental techniques causes bias in subsequent analyzes on datasets.

Aim: Demonstrate the positive impact of multivariate imputation in the feature selection process on datasets with missing values.

Results: We compared the effects of the feature selection process using complete datasets, incomplete datasets with missingness rates between 5 and 50%, and imputed datasets by basic techniques and multivariate imputation. The feature selection algorithms used are well-known methods. The results showed that the datasets imputed by multivariate imputation obtained the best results in feature selection compared to datasets imputed by basic techniques or non-imputed incomplete datasets.

Conclusions: Considering the results obtained in the evaluation, applying multivariate imputation by MICE reduces bias in the feature selection process.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8318311PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0254720PLOS

Publication Analysis

Top Keywords

feature selection
28
multivariate imputation
20
missing values
16
selection process
16
impact multivariate
8
imputation mice
8
datasets
8
complete datasets
8
imputation feature
8
basic techniques
8

Similar Publications

Background: To develop and validate a clinical-radiomics model for preoperative prediction of lymphovascular invasion (LVI) in rectal cancer.

Methods: This retrospective study included data from 239 patients with pathologically confirmed rectal adenocarcinoma from two centers, all of whom underwent MRI examinations. Cases from the first center (n = 189) were randomly divided into a training set and an internal validation set at a 7:3 ratio, while cases from the second center (n = 50) constituted the external validation set.

View Article and Find Full Text PDF

Summary: With the increased reliance on multi-omics data for bulk and single cell analyses, the availability of robust approaches to perform unsupervised learning for clustering, visualization, and feature selection is imperative. We introduce nipalsMCIA, an implementation of multiple co-inertia analysis (MCIA) for joint dimensionality reduction that solves the objective function using an extension to Non-linear Iterative Partial Least Squares (NIPALS). We applied nipalsMCIA to both bulk and single cell datasets and observed significant speed-up over other implementations for data with a large sample size and/or feature dimension.

View Article and Find Full Text PDF

A prediction study on the occurrence risk of heart disease in older hypertensive patients based on machine learning.

BMC Geriatr

January 2025

Department of Cardiology, The Second Hospital & Clinical Medical School, Lanzhou University, No. 82 Cuiyingmen, Lanzhou, 730000, China.

Objective: Constructing a predictive model for the occurrence of heart disease in elderly hypertensive individuals, aiming to provide early risk identification.

Methods: A total of 934 participants aged 60 and above from the China Health and Retirement Longitudinal Study with a 7-year follow-up (2011-2018) were included. Machine learning methods (logistic regression, XGBoost, DNN) were employed to build a model predicting heart disease risk in hypertensive patients.

View Article and Find Full Text PDF

Knee osteoarthritis (KOA) represents a progressive degenerative disorder characterized by the gradual erosion of articular cartilage. This study aimed to develop and validate biomarker-based predictive models for KOA diagnosis using machine learning techniques. Clinical data from 2594 samples were obtained and stratified into training and validation datasets in a 7:3 ratio.

View Article and Find Full Text PDF

The growing number of connected devices in smart home environments has amplified security risks, particularly from Man-in-the-Middle (MitM) attacks. These attacks allow cybercriminals to intercept and manipulate communication streams between devices, often remaining undetected. Traditional rule-based methods struggle to cope with the complexity of these attacks, creating a need for more advanced, adaptive intrusion detection systems.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!