Splitting on categorical predictors in random forests.

PeerJ

Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.

Published: February 2019

One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2 - 1 2-partitions of the predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only - 1 splits have to be considered for a nominal predictor with categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. We therefore propose to use a heuristic which orders the categories according to the first principal component of the weighted covariance matrix in multiclass classification and by log-rank scores in survival prediction. This ordering of categories can be done either in every split or a priori, that is, just once before growing the forest. With this approach, the nominal predictor can be treated as ordinal in the entire RF procedure, speeding up the computation and avoiding category limits. We compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on real data in terms of prediction performance and computational efficiency. We show that ordering the categories a priori is at least as good as the standard approach of considering all 2-partitions in all datasets considered, while being computationally faster. We recommend to use this approach as the default in RFs.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6368971PMC
http://dx.doi.org/10.7717/peerj.6339DOI Listing

Publication Analysis

Top Keywords

standard approach
16
predictor categories
12
random forests
8
dummy coding
8
nominal predictors
8
approach nominal
8
computational complexity
8
categories split
8
nominal predictor
8
multiclass classification
8

Similar Publications

Evaluation of nationwide analysis surveillance for methicillin-resistant within Genomic Medicine Sweden.

Microb Genom

January 2025

Department of Laboratory Medicine, Clinical Microbiology, Faculty of Medicine and Health, rebro University, rebro, Sweden.

National epidemiological investigations of microbial infections greatly benefit from the increased information gained by whole-genome sequencing (WGS) in combination with standardized approaches for data sharing and analysis. To evaluate the quality and accuracy of WGS data generated by different laboratories but analysed by joint pipelines to reach a national surveillance approach. A national methicillin-resistant (MRSA) collection of 20 strains was distributed to nine participating laboratories that performed in-house procedures for WGS.

View Article and Find Full Text PDF

Background: Crohn's disease (CD) is a chronic, recurrent gastrointestinal disorder characterized by a complex etiology. Among its perianal complications, anal fistulas represent a challenging comorbidity. With the increase of surgical options, a comprehensive bibliometric analysis was deemed necessary to consolidate the vast array of research in this field.

View Article and Find Full Text PDF

The endocannabinoid system (ECS), regulating such processes as energy homeostasis, inflammation, and muscle function, centers around cannabinoid receptors, including CB1. These receptors are mainly located in the central nervous system and skeletal muscles. Hyperactivity of CB1 receptors is linked to metabolic disorders and chronic inflammation, highlighting their potential as therapeutic targets for muscle hypertrophy and metabolic health.

View Article and Find Full Text PDF

The word "cancer" evokes myriad emotions, ranging from fear and despair to hope and determination. Cancer is aptly defined as a complex and multifaceted group of diseases that has unapologetically led to the loss of countless lives and affected innumerable families across the globe. The battle with cancer is not only a physical battle, but also an emotional, as well as a psychological skirmish for patients and for their loved ones.

View Article and Find Full Text PDF

Emerging Deep Brain Stimulation Targets in the Cerebellum for Tremor.

Cerebellum

January 2025

Department of Neurology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA.

Deep brain stimulation (DBS) for essential tremor is remarkably effective, leading to over 80% reduction in standardized tremor ratings. However, for certain types of tremor, such as those accompanied by ataxia or dystonia, conventional DBS targets have shown poor efficacy. Various rationales for using cerebellar DBS stimulation to treat tremor have been advanced, but the varied approaches leave many questions unanswered: which anatomic target, stimulation settings, and indications seem most promising for this emerging approach.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!