Iterative feature removal yields highly discriminative pathways.

Stephen O'Hara Kun Wang Richard A Slayden Alan R Schenkel Greg Huber Corey S O'Hern Mark D Shattuck Michael Kirby

BMC Genomics

Department of Mathematics, Colorado State University, Fort Collins, CO, USA.

Published: November 2013

Background: We introduce Iterative Feature Removal (IFR) as an unbiased approach for selecting features with diagnostic capacity from large data sets. The algorithm is based on recently developed tools in machine learning that are driven by sparse feature selection goals. When applied to genomic data, our method is designed to identify genes that can provide deeper insight into complex interactions while remaining directly connected to diagnostic utility. We contrast this approach with the search for a minimal best set of discriminative genes, which can provide only an incomplete picture of the biological complexity.

Results: Microarray data sets typically contain far more features (genes) than samples. For this type of data, we demonstrate that there are many equivalently-predictive subsets of genes. We iteratively train a classifier using features identified via a sparse support vector machine. At each iteration, we remove all the features that were previously selected. We found that we could iterate many times before a sustained drop in accuracy occurs, with each iteration removing approximately 30 genes from consideration. The classification accuracy on test data remains essentially flat even as hundreds of top-genes are removed.Our method identifies sets of genes that are highly predictive, even when comprised of genes that individually are not. Through automated and manual analysis of the selected genes, we demonstrate that the selected features expose relevant pathways that other approaches would have missed.

Conclusions: Our results challenge the paradigm of using feature selection techniques to design parsimonious classifiers from microarray and similar high-dimensional, small-sample-size data sets. The fact that there are many subsets of genes that work equally well to classify the data provides a strong counter-result to the notion that there is a small number of "top genes" that should be used to build classifiers. In our results, the best classifiers were formed using genes with limited univariate power, thus illustrating that deeper mining of features using multivariate techniques is important.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879090	PMC
http://dx.doi.org/10.1186/1471-2164-14-832	DOI Listing

Publication Analysis

Top Keywords

data sets

genes

iterative feature

feature removal

feature selection

genes provide

subsets genes

data

features

removal yields

Similar Publications

Associations of the Intake of Individual and Multiple Flavonoids with Metabolic Dysfunction Associated Steatotic Liver Disease in the United States.

Nutrients

January 2025

Department of Nutrition, School of Public Health, Sun Yat-sen University, 74 Zhong Shan Road 2, Guangzhou 510080, China.

Chen Wang Mengchu Li Jiali Zhang Hongguang Li Yue Li

Background: Evidence regarding the individual and combined impact of dietary flavonoids on the risk of metabolic dysfunction associated with steatotic liver disease (MASLD) remains scarce. Our objective is to evaluate the association between individual and multiple dietary flavonoids with MASLD in adults.

Methods: Data sets were obtained from the National Health and Nutrition Examination Survey (NHANES), 2017-2018.

View Article and Find Full Text PDF

Similar Publications

Impact Damage Localization in Composite Structures Using Data-Driven Machine Learning Methods.

Materials (Basel)

January 2025

College of Mechanical Engineering, Yangzhou University, Yangzhou 225127, China.

Can Tang Yujie Zhou Guoqian Song Wenfeng Hao

Due to the uncertainty of material properties of plate-like structures, many traditional methods are unable to locate the impact source on their surface in real time. It is important to study the impact source-localization problem for plate structures. In this paper, a data-driven machine learning method is proposed to detect impact sources in plate-like structures and its effectiveness is tested on three plate-like structures with different material properties.

View Article and Find Full Text PDF

Similar Publications

The Platelet-Specific Gene Signature in the Immunoglobulin G4-Related Disease Transcriptome.

Medicina (Kaunas)

January 2025

Department of Internal Medicine (Nephrology), Faculty of Medicine, Ufuk University, 06510 Ankara, Turkey.

Ali Kemal Oguz Cagdas Sahap Oygur Bala Gur Dedeoglu Irem Dogan Turacli Sibel Serin Kilicoglu

Immunoglobulin G4-related disease (IgG4-RD) is an immune-mediated, fibroinflammatory, multiorgan disease with an obscure pathogenesis. Findings indicating excessive platelet activation have been reported in systemic sclerosis, which is another autoimmune, multisystemic fibrotic disorder. The immune-mediated, inflammatory, and fibrosing intersections of IgG4-RD and systemic sclerosis raised a question about platelets' role in IgG4-RD.

View Article and Find Full Text PDF

Similar Publications

PLASMA: Partial LeAst Squares for Multiomics Analysis.

Cancers (Basel)

January 2025

Department of Biostatistics, Data Science, and Epidemiology, School of Public Health, Georgia Cancer Center at Augusta University, Augusta, GA 30912, USA.

Kyoko Yamaguchi Salma Abdelbaky Lianbo Yu Christopher C Oakes Lynne V Abruzzo

: Recent growth in the number and applications of high-throughput "omics" technologies has created a need for better methods to integrate multiomics data. Much progress has been made in developing unsupervised methods, but supervised methods have lagged behind. : Here we present the first algorithm, PLASMA, that can learn to predict time-to-event outcomes from multiomics data sets, even when some samples have only been assayed on a subset of the omics data sets.

View Article and Find Full Text PDF

Similar Publications

Patterns of Change in Athletic Identity After Anterior Cruciate Ligament Reconstruction.

Int J Environ Res Public Health

January 2025

Department of Psychology, Springfield College, 263 Alden Street, Springfield, MA 01109, USA.

Britton W Brewer Rachel Shinnick Allen E Cornelius Judy L Van Raalte Fahimeh Badiei

Changes in athletic identity have been documented after injury and other sport transitions in nomothetic investigations. Patterns of change in athletic identity after injury have not been examined systematically at the individual level. In the current study, secondary analyses were performed on two data sets ( = 43 and = 80) in which athletic identity values were available for before and at least six months after anterior cruciate ligament (ACL) reconstruction.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!