Finding a good predictive model for a high-dimensional data set can be challenging. For genetic data, it is not only important to find a model with high predictive accuracy, but it is also important that this model uses only few features and that the selection of these features is stable. This is because, in bioinformatics, the models are used not only for prediction but also for drawing biological conclusions which makes the interpretability and reliability of the model crucial. We suggest using three target criteria when fitting a predictive model to a high-dimensional data set: the classification accuracy, the stability of the feature selection, and the number of chosen features. As it is unclear which measure is best for evaluating the stability, we first compare a variety of stability measures. We conclude that the Pearson correlation has the best theoretical and empirical properties. Also, we find that for the stability assessment behaviour it is most important that a measure contains a correction for chance or large numbers of chosen features. Then, we analyse Pareto fronts and conclude that it is possible to find models with a stable selection of few features without losing much predictive accuracy.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5556617PMC
http://dx.doi.org/10.1155/2017/7907163DOI Listing

Publication Analysis

Top Keywords

high-dimensional data
12
models stable
8
feature selection
8
predictive model
8
model high-dimensional
8
data set
8
predictive accuracy
8
selection features
8
chosen features
8
predictive
5

Similar Publications

scSMD: a deep learning method for accurate clustering of single cells based on auto-encoder.

BMC Bioinformatics

January 2025

Department of Surgery, Shanghai Key Laboratory of Gastric Neoplasms, Shanghai Institute of Digestive Surgery, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.

Background: Single-cell RNA sequencing (scRNA-seq) has transformed biological research by offering new insights into cellular heterogeneity, developmental processes, and disease mechanisms. As scRNA-seq technology advances, its role in modern biology has become increasingly vital. This study explores the application of deep learning to single-cell data clustering, with a particular focus on managing sparse, high-dimensional data.

View Article and Find Full Text PDF

Air pollution is a critical global environmental issue, further exacerbated by rapid industrialization and urbanization. Accurate prediction of air pollutant concentrations is essential for effective pollution prevention and control measures. The complex nature of pollutant data is influenced by fluctuating meteorological conditions, diverse pollution sources, and propagation processes, underscores the crucial importance of the spatial and temporal feature extraction for accurately predicting air pollutant concentrations.

View Article and Find Full Text PDF

CRAmed: a conditional randomization test for high-dimensional mediation analysis in sparse microbiome data.

Bioinformatics

January 2025

Department of Statistics, School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China.

Motivation: Numerous microbiome studies have revealed significant associations between the microbiome and human health and disease. These findings have motivated researchers to explore the causal role of the microbiome in human complex traits and diseases. However, the complexities of microbiome data pose challenges for statistical analysis and interpretation of causal effects.

View Article and Find Full Text PDF

Tensor neural networks for high-dimensional Fokker-Planck equations.

Neural Netw

January 2025

Division of Applied Mathematics, Brown University, Providence, RI 02912, USA; Advanced Computing, Mathematics and Data Division, Pacific Northwest National Laboratory, Richland, WA, United States. Electronic address:

We solve high-dimensional steady-state Fokker-Planck equations on the whole space by applying tensor neural networks. The tensor networks are a linear combination of tensor products of one-dimensional feedforward networks or a linear combination of several selected radial basis functions. The use of tensor feedforward networks allows us to efficiently exploit auto-differentiation (in physical variables) in major Python packages while using radial basis functions can fully avoid auto-differentiation, which is rather expensive in high dimensions.

View Article and Find Full Text PDF

Understanding how the collective activity of neural populations relates to computation and ultimately behavior is a key goal in neuroscience. To this end, statistical methods which describe high-dimensional neural time series in terms of low-dimensional latent dynamics have played a fundamental role in characterizing neural systems. Yet, what constitutes a successful method involves two opposing criteria: (1) methods should be expressive enough to capture complex nonlinear dynamics, and (2) they should maintain a notion of interpretability often only warranted by simpler linear models.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!