Distance Correlation-Based Feature Selection in Random Forest.

Entropy (Basel)

Department of Mathematics, California State University, San Bernardino, CA 92407, USA.

Published: August 2023

The Pearson correlation coefficient (ρ) is a commonly used measure of correlation, but it has limitations as it only measures the linear relationship between two numerical variables. The distance correlation measures all types of dependencies between random vectors and in arbitrary dimensions, not just the linear ones. In this paper, we propose a filter method that utilizes distance correlation as a criterion for feature selection in Random Forest regression. We conduct extensive simulation studies to evaluate its performance compared to existing methods under various data settings, in terms of the prediction mean squared error. The results show that our proposed method is competitive with existing methods and outperforms all other methods in high-dimensional (p≥300) nonlinearly related data sets. The applicability of the proposed method is also illustrated by two real data applications.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10528294PMC
http://dx.doi.org/10.3390/e25091250DOI Listing

Publication Analysis

Top Keywords

feature selection
8
selection random
8
random forest
8
distance correlation
8
existing methods
8
proposed method
8
distance correlation-based
4
correlation-based feature
4
forest pearson
4
correlation
4

Similar Publications

Protocol for quantifying muscle fiber size, number, and central nucleation of mouse skeletal muscle cross-sections using Myotally software.

STAR Protoc

January 2025

Department of Neurology and Neurological Sciences, Stanford University School of Medicine, Stanford, CA 94305, USA; Neurology Service, Veterans Affairs Palo Alto Health Care System, Palo Alto, CA 94304, USA. Electronic address:

Here, we present a protocol for using Myotally, a user-friendly software for fast, automated quantification of muscle fiber size, number, and central nucleation from immunofluorescent stains of mouse skeletal muscle cross-sections. We describe steps for installing the software, preparing compatible images, finding the file path, and selecting key parameters like image quality and size limits. We also detail optional features, such as measuring mean fluorescence.

View Article and Find Full Text PDF

Background: To develop and validate a clinical-radiomics model for preoperative prediction of lymphovascular invasion (LVI) in rectal cancer.

Methods: This retrospective study included data from 239 patients with pathologically confirmed rectal adenocarcinoma from two centers, all of whom underwent MRI examinations. Cases from the first center (n = 189) were randomly divided into a training set and an internal validation set at a 7:3 ratio, while cases from the second center (n = 50) constituted the external validation set.

View Article and Find Full Text PDF

Summary: With the increased reliance on multi-omics data for bulk and single cell analyses, the availability of robust approaches to perform unsupervised learning for clustering, visualization, and feature selection is imperative. We introduce nipalsMCIA, an implementation of multiple co-inertia analysis (MCIA) for joint dimensionality reduction that solves the objective function using an extension to Non-linear Iterative Partial Least Squares (NIPALS). We applied nipalsMCIA to both bulk and single cell datasets and observed significant speed-up over other implementations for data with a large sample size and/or feature dimension.

View Article and Find Full Text PDF

A prediction study on the occurrence risk of heart disease in older hypertensive patients based on machine learning.

BMC Geriatr

January 2025

Department of Cardiology, The Second Hospital & Clinical Medical School, Lanzhou University, No. 82 Cuiyingmen, Lanzhou, 730000, China.

Objective: Constructing a predictive model for the occurrence of heart disease in elderly hypertensive individuals, aiming to provide early risk identification.

Methods: A total of 934 participants aged 60 and above from the China Health and Retirement Longitudinal Study with a 7-year follow-up (2011-2018) were included. Machine learning methods (logistic regression, XGBoost, DNN) were employed to build a model predicting heart disease risk in hypertensive patients.

View Article and Find Full Text PDF

Knee osteoarthritis (KOA) represents a progressive degenerative disorder characterized by the gradual erosion of articular cartilage. This study aimed to develop and validate biomarker-based predictive models for KOA diagnosis using machine learning techniques. Clinical data from 2594 samples were obtained and stratified into training and validation datasets in a 7:3 ratio.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!