Clustering is a fundamental tool for exploratory data analysis, and is ubiquitous across scientific disciplines. Gaussian Mixture Model (GMM) is a popular probabilistic and interpretable model for clustering. In many practical settings, the true data distribution, which is unknown, may be non-Gaussian and may be contaminated by noise or outliers. In such cases, clustering may still be done with a misspecified GMM. However, this may lead to incorrect classification of the underlying subpopulations. In this paper, we identify and characterize the problem of inferior clustering solutions. Similar to well-known spurious solutions, these inferior solutions have high likelihood and poor cluster interpretation; however, they differ from spurious solutions in other characteristics, such as asymmetry in the fitted components. We theoretically analyze this asymmetry and its relation to misspecification. We propose a new penalty term that is designed to avoid both inferior and spurious solutions. Using this penalty term, we develop a new model selection criterion and a new GMM-based clustering algorithm, SIA. We empirically demonstrate that, in cases of misspecification, SIA avoids inferior solutions and outperforms previous GMM-based clustering methods.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10628229 | PMC |
http://dx.doi.org/10.1038/s41598-023-44608-3 | DOI Listing |
Nat Food
January 2025
School of Biological Sciences, University of Aberdeen, Aberdeen, UK.
Nutritional epidemiology aims to link dietary exposures to chronic disease, but the instruments for evaluating dietary intake are inaccurate. One way to identify unreliable data and the sources of errors is to compare estimated intakes with the total energy expenditure (TEE). In this study, we used the International Atomic Energy Agency Doubly Labeled Water Database to derive a predictive equation for TEE using 6,497 measures of TEE in individuals aged 4 to 96 years.
View Article and Find Full Text PDFNAR Genom Bioinform
March 2025
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.
Evaluating the accuracy of protein-coding sequences in genome annotations is a challenging problem for which there is no broadly applicable solution. In this manuscript, we introduce PSAURON (Protein Sequence Assessment Using a Reference ORF Network), a novel software tool developed to help assess the quality of protein-coding gene annotations. Utilizing a machine learning model trained on a diverse dataset from over 1000 plant and animal genomes, PSAURON assigns a score to coding DNA or protein sequence that reflects the likelihood that the sequence is a genuine protein-coding region.
View Article and Find Full Text PDFSci Rep
December 2024
Department of Geophysics, Graduate School of Science, Tohoku University, Sendai, 980-8578, Japan.
Accurate characterisation of seismic source mechanisms in mining environments is crucial for effective hazard mitigation, but it is complicated by the presence of anisotropic geological conditions. Neglecting anisotropic effects during moment tensor (MT) inversion introduces significant distortions in the retrieved source characteristics. In this study, we investigated the impact of ignoring anisotropy during MT inversion on the reliability of hazard assessment.
View Article and Find Full Text PDFPhys Chem Chem Phys
January 2025
Department of Physical Chemistry, University of Chemistry and Technology Prague, Technická 5, CZ-166 28 Prague 6, Praha, Czech Republic.
Poor aqueous solubility of crystalline active pharmaceutical ingredients (APIs) restricts their bioavailability. Amorphous solid dispersions with biocompatible polymer excipients offer a solution to overcome this problem, potentially enabling a broader use of many drug candidate molecules. This work addresses various aspects of the design of a suitable combination of an API and a polymer to form such a binary solid dispersion.
View Article and Find Full Text PDFCan J Cardiol
December 2024
MAP Centre for Urban Health Solutions, St. Michael's Hospital, Toronto, Ontario, Canada; Division of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada.
Systematic error, often referred to as bias is an inherent challenge in observational cardiovascular research, and has the potential to profoundly influence the design, conduct, and interpretation of study results. If not carefully considered and managed, bias can lead to spurious results, which can misinform clinical practice or public health initiatives and compromise patient outcomes. This methodological primer offers a concise introduction to the identification, evaluation, and mitigation of bias in observational cardiovascular research studies assessing the causal association of an exposure (or treatment) on an outcome.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!