Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction.

BMC Med Res Methodol

Department of Biostatistics, Key Laboratory on Public Health Safety of the Ministry of Education, School of Public Health, Fudan University, Shanghai, China.

Published: July 2020

Background: Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions.

Methods: To examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM).

Results: Both missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction.

Conclusions: RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7382855PMC
http://dx.doi.org/10.1186/s12874-020-01080-1DOI Listing

Publication Analysis

Top Keywords

missing data
20
imputation methods
12
rf-based imputation
12
data
8
missforest caliberrfimpute
8
highly skewed
8
imputation
6
missing
6
missforest
5
accuracy random-forest-based
4

Similar Publications

Metabolic syndrome (Mets) in adolescents is a growing public health issue linked to obesity, hypertension, and insulin resistance, increasing risks of cardiovascular disease and mental health problems. Early detection and intervention are crucial but often hindered by complex diagnostic requirements. This study aims to develop a predictive model using NHANES data, excluding biochemical indicators, to provide a simple, cost-effective tool for large-scale, non-medical screening and early prevention of adolescent MetS.

View Article and Find Full Text PDF

Diabetes is a growing health concern in developing countries, causing considerable mortality rates. While machine learning (ML) approaches have been widely used to improve early detection and treatment, several studies have shown low classification accuracies due to overfitting, underfitting, and data noise. This research employs parallel and sequential ensemble ML approaches paired with feature selection techniques to boost classification accuracy.

View Article and Find Full Text PDF

The characteristics of data produced by omics technologies are pivotal, as they critically influence the feasibility and effectiveness of computational methods applied in downstream analyses, such as data harmonization and differential abundance analyses. Furthermore, variability in these data characteristics across datasets plays a crucial role, leading to diverging outcomes in benchmarking studies, which are essential for guiding the selection of appropriate analysis methods in all omics fields. Additionally, downstream analysis tools are often developed and applied within specific omics communities due to the presumed differences in data characteristics attributed to each omics technology.

View Article and Find Full Text PDF

[Solid, endometrial-like and transitional growth patterns of ovarian high-grade serous carcinoma: a clinicopathological analysis of 25 cases].

Zhonghua Bing Li Xue Za Zhi

February 2025

Department of Pathology, the Affiliated Suzhou Hospital of Nanjing Medical University, Suzhou Municipal Hospital, Gusu School, Nanjing Medical University, Suzhou 215002, China.

To investigate the clinicopathological characteristics of solid, endometrial-like and transitional (SET) cell growth subtype in high-grade serous ovarian carcinoma (HGSC). Clinical data of 25 cases of HGSC-SET were collected from January 2020 to March 2024 at the Affiliated Suzhou Hospital of Nanjing Medical University, and their histological features were analyzed. Immunohistochemical stains were used to analyze the expression of ER, PR, PAX8, WT-1, p16, p53 and Ki-67.

View Article and Find Full Text PDF

Background Context: Recumbent MRI is the most widely used image modality in people with low back pain (LBP), however, it has been proposed that upright (standing) MRI has advantages over recumbent MRI because of its ability to assess the effects of being weight-bearing. It has been suggested that this produces systematic differences in MRI parameters and differences in the correlation between MRI parameters and pain or disability in patients thus, potentially adding clinically helpful information.

Purpose: This paper aims to review and summarize the available empirical evidence for or against these two hypotheses.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!