Advanced methods for missing values imputation based on similarity learning.

PeerJ Comput Sci

Faculty of Computers and Artificial Intelligence, Benha University, Benha, Qaliobia, Egypt.

Published: July 2021

The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods' accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8323724PMC
http://dx.doi.org/10.7717/peerj-cs.619DOI Listing

Publication Analysis

Top Keywords

missing data
36
missing values
24
k-nearest neighbors
20
data imputation
20
imputation
18
missing
17
imputation method
16
data
12
imputation techniques
12
imputation methods
12

Similar Publications

Lobar pneumonia is an acute inflammation with increasing incidence globally. Delayed treatment can lead to severe complications, posing life-threatening risks. Thus, it is crucial to determine effective treatment methods to improve the prognosis of children with lobar pneumonia.

View Article and Find Full Text PDF

Socio-economic inequalities in second primary cancer incidence: A competing risks analysis of women with breast cancer in England between 2000 and 2018.

Int J Cancer

January 2025

Inequalities in Cancer Outcomes Network (ICON) group, Department of Health Services Research and Policy, Faculty of Public Health and Policy, London School of Hygiene & Tropical Medicine, London, UK.

We aimed to investigate socio-economic inequalities in second primary cancer (SPC) incidence among breast cancer survivors. Using Data from cancer registries in England, we included all women diagnosed with a first primary breast cancer (PBC) between 2000 and 2018 and aged between 18 and 99 years and followed them up from 6 months after the PBC diagnosis until a SPC event, death, or right censoring, whichever came first. We used flexible parametric survival models adjusting for age and year of PBC diagnosis, ethnicity, PBC tumour stage, comorbidity, and PBC treatments to model the cause-specific hazards of SPC incidence and death according to income deprivation, and then estimated standardised cumulative incidences of SPC by deprivation, taking death as the competing event.

View Article and Find Full Text PDF

Equine temporomandibular joint diseases: A systematic review.

Equine Vet J

January 2025

Department of Large Animal Diseases and Clinic, Institute of Veterinary Medicine, Warsaw University of Life Sciences (WULS - SGGW), Warsaw, Poland.

Background: The temporomandibular joint (TMJ) is a unique joint that enables mandibular movement. Temporomandibular diseases (TMDs) impair joint function, leading to more or less specific clinical signs.

Objectives: To compile and disseminate clinical data and research findings from existing publications on equine TMD.

View Article and Find Full Text PDF

JC polyomavirus (JCPyV) establishes a persistent, asymptomatic kidney infection in most of the population. However, JCPyV can reactivate in immunocompromised individuals and cause progressive multifocal leukoencephalopathy (PML), a fatal demyelinating disease with no approved treatment. Mutations in the hypervariable non-coding control region (NCCR) of the JCPyV genome have been linked to disease outcomes and neuropathogenesis, yet few metanalyses document these associations.

View Article and Find Full Text PDF

Low-Complexity Timing Correction Methods for Heart Rate Estimation Using Remote Photoplethysmography.

Sensors (Basel)

January 2025

Department of Biomedical and Robotics Engineering, Incheon National University, Incheon 22012, Republic of Korea.

With the rise of modern healthcare monitoring, heart rate (HR) estimation using remote photoplethysmography (rPPG) has gained attention for its non-contact, continuous tracking capabilities. However, most HR estimation methods rely on stable, fixed sampling intervals, while practical image capture often involves irregular frame rates and missing data, leading to inaccuracies in HR measurements. This study addresses these issues by introducing low-complexity timing correction methods, including linear, cubic, and filter interpolation, to improve HR estimation from rPPG signals under conditions of irregular sampling and data loss.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!