When data exhibit imbalance between a large number d of covariates and a small number n of samples, clinical outcome prediction is impaired by overfitting and prohibitive computation demands. Here we study two simple Bayesian prediction protocols that can be applied to data of any dimension and any number of outcome classes. Calculating Bayesian integrals and optimal hyperparameters analytically leaves only a small number of numerical integrations, and CPU demands scale as O(nd). We compare their performance on synthetic and genomic data to the mclustDA method of Fraley and Raftery. For small d they perform as well as mclustDA or better. For d = 10,000 or more mclustDA breaks down computationally, while the Bayesian methods remain efficient. This allows us to explore phenomena typical of classification in high-dimensional spaces, such as overfitting and the reduced discriminative effectiveness of signatures compared to intra-class variability.

Download full-text PDF

Source
http://dx.doi.org/10.1177/0962280216628901DOI Listing

Publication Analysis

Top Keywords

classification high-dimensional
8
small number
8
bayesian
4
bayesian clinical
4
clinical classification
4
data
4
high-dimensional data
4
data signatures
4
signatures versus
4
versus variability
4

Similar Publications

Feature selection (FS) is a critical step in hyperspectral image (HSI) classification, essential for reducing data dimensionality while preserving classification accuracy. However, FS for HSIs remains an NP-hard challenge, as existing swarm intelligence and evolutionary algorithms (SIEAs) often suffer from limited exploration capabilities or susceptibility to local optima, particularly in high-dimensional scenarios. To address these challenges, we propose GWOGA, a novel hybrid algorithm that combines Grey Wolf Optimizer (GWO) and Genetic Algorithm (GA), aiming to achieve an effective balance between exploration and exploitation.

View Article and Find Full Text PDF

This paper presents a surrogate-assisted global and distributed local collaborative optimization (SGDLCO) algorithm for expensive constrained optimization problems where two surrogate optimization phases are executed collaboratively at each generation. As the complexity of optimization problems and the cost of solutions increase in practical applications, how to efficiently solve expensive constrained optimization problems with limited computational resources has become an important area of research. Traditional optimization algorithms often struggle to balance the efficiency of global and local searches, especially when dealing with high-dimensional and complex constraint conditions.

View Article and Find Full Text PDF

Objective: This study evaluates the utility of word embeddings, generated by large language models (LLMs), for medical diagnosis by comparing the semantic proximity of symptoms to their eponymic disease embedding ("eponymic condition") and the mean of all symptom embeddings associated with a disease ("ensemble mean").

Materials And Methods: Symptom data for 5 diagnostically challenging pediatric diseases-CHARGE syndrome, Cowden disease, POEMS syndrome, Rheumatic fever, and Tuberous sclerosis-were collected from PubMed. Using the Ada-002 embedding model, disease names and symptoms were translated into vector representations in a high-dimensional space.

View Article and Find Full Text PDF

The immune composition of solid tumors is typically inferred from biomarkers, such as histologic and molecular classifications, somatic mutational burden, and PD-L1 expression. However, the extent to which these biomarkers predict the immune landscape in gastric adenocarcinoma-an aggressive cancer often linked to chronic inflammation-remains poorly understood. We leveraged high-dimensional spectral cytometry to generate a comprehensive single-cell immune landscape of tumors, normal tissue, and lymph nodes from patients in the Western Hemisphere with gastric adenocarcinoma.

View Article and Find Full Text PDF

Alzheimer's disease (AD) is a progressive neurodegenerative disorder and is the most common cause of dementia. Early diagnosis of Alzheimer's disease is critical for better management and treatment outcomes, but it remains a challenging task due to the complex nature of the disease. Clinical data, including a range of cognitive, functional, and demographic variables, play a crucial role in Alzheimer's disease classification.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!