Robustness of Random Forest-based gene selection methods.

BMC Bioinformatics

Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Pawinskiego 5A, 02-106 Warsaw, Poland.

Published: January 2014

Background: Gene selection is an important part of microarray data analysis because it provides information that can lead to a better mechanistic understanding of an investigated phenomenon. At the same time, gene selection is very difficult because of the noisy nature of microarray data. As a consequence, gene selection is often performed with machine learning methods. The Random Forest method is particularly well suited for this purpose. In this work, four state-of-the-art Random Forest-based feature selection methods were compared in a gene selection context. The analysis focused on the stability of selection because, although it is necessary for determining the significance of results, it is often ignored in similar studies.

Results: The comparison of post-selection accuracy of a validation of Random Forest classifiers revealed that all investigated methods were equivalent in this context. However, the methods substantially differed with respect to the number of selected genes and the stability of selection. Of the analysed methods, the Boruta algorithm predicted the most genes as potentially important.

Conclusions: The post-selection classifier error rate, which is a frequently used measure, was found to be a potentially deceptive measure of gene selection quality. When the number of consistently selected genes was considered, the Boruta algorithm was clearly the best. Although it was also the most computationally intensive method, the Boruta algorithm's computational demands could be reduced to levels comparable to those of other algorithms by replacing the Random Forest importance with a comparable measure from Random Ferns (a similar but simplified classifier). Despite their design assumptions, the minimal optimal selection methods, were found to select a high fraction of false positives.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3897925PMC
http://dx.doi.org/10.1186/1471-2105-15-8DOI Listing

Publication Analysis

Top Keywords

gene selection
24
selection methods
12
random forest
12
selection
10
random forest-based
8
microarray data
8
stability selection
8
selected genes
8
boruta algorithm
8
methods
7

Similar Publications

Objective: Heavy metal pollution is one of the more recent problems of environmental degradation caused by rapid industrialization and human activity. The objective of this study was to isolate, screen, and characterize heavy metal-resistant bacteria from solid waste disposal sites.

Methods: In this study, a total of 18 soil samples were randomly selected from mechanical sites, metal workshops, and agricultural land that received wastewater irrigation.

View Article and Find Full Text PDF

Insular species are usually endemic and prone to long-term population reduction, low genetic diversity, and inbreeding depression, which results in difficulties in species conservation. The situation is even more challenging for the glacial relict species whose habitats are usually fragmented in the mountainous regions. is an endangered and endemic relict tree species in Taiwan.

View Article and Find Full Text PDF

The wall-associated kinase (WAK) gene family encodes functional cell wall-related proteins. These genes are widely presented in plants and serve as the receptors of plant cell membranes, which perceive the external environment changes and activate signaling pathways to participate in plant growth, development, defense, and stress response. However, the WAK gene family and the encoded proteins in soybean (Glycine max (L.

View Article and Find Full Text PDF

Abraham Patchornik was born in 1926 in Ness Ziona, a town in Palestine founded by his great-grandfather Reuben Lehrer in 1883. He started to study chemistry as an undergraduate at the Hebrew University. However, this was interrupted by the war, and he completed his studies in various locations in West Jerusalem.

View Article and Find Full Text PDF

Polyketide synthases (PKSs) are multidomain enzymatic assembly lines that biosynthesize a wide selection of bioactive natural products from simple building blocks. In contrast to their -acyltransferase (AT) counterparts, -AT PKSs rely on stand-alone ATs to load extender units onto acyl carrier protein (ACP) domains embedded in the core PKS machinery. -AT PKS gene clusters also encode stand-alone acyl hydrolases (AHs), which are predicted to share the overall fold of ATs but function like type II thioesterases (TEs), hydrolyzing aberrant acyl chains from ACP domains to promote biosynthetic efficiency.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!