Background: Topic models are a class of unsupervised machine learning models, which facilitate summarization, browsing and retrieval from large unstructured document collections. This study reviews several methods for assessing the quality of unsupervised topic models estimated using non-negative matrix factorization. Techniques for topic model validation have been developed across disparate fields. We synthesize this literature, discuss the advantages and disadvantages of different techniques for topic model validation, and illustrate their usefulness for guiding model selection on a large clinical text corpus.
Design, Setting And Data: Using a retrospective cohort design, we curated a text corpus containing 382,666 clinical notes collected between 01/01/2017 through 12/31/2020 from primary care electronic medical records in Toronto Canada.
Methods: Several topic model quality metrics have been proposed to assess different aspects of model fit. We explored the following metrics: reconstruction error, topic coherence, rank biased overlap, Kendall's weighted tau, partition coefficient, partition entropy and the Xie-Beni statistic. Depending on context, cross-validation and/or bootstrap stability analysis were used to estimate these metrics on our corpus.
Results: Cross-validated reconstruction error favored large topic models (K ≥ 100 topics) on our corpus. Stability analysis using topic coherence and the Xie-Beni statistic also favored large models (K = 100 topics). Rank biased overlap and Kendall's weighted tau favored small models (K = 5 topics). Few model evaluation metrics suggested mid-sized topic models (25 ≤ K ≤ 75) as being optimal. However, human judgement suggested that mid-sized topic models produced expressive low-dimensional summarizations of the corpus.
Conclusions: Topic model quality indices are transparent quantitative tools for guiding model selection and evaluation. Our empirical illustration demonstrated that different topic model quality indices favor models of different complexity; and may not select models aligning with human judgment. This suggests that different metrics capture different aspects of model goodness of fit. A combination of topic model quality indices, coupled with human validation, may be useful in appraising unsupervised topic models.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10362613 | PMC |
http://dx.doi.org/10.1186/s12911-023-02216-1 | DOI Listing |
Adv Sci (Weinh)
January 2025
School of Public Health, Capital Medical University, Beijing, 100069, P. R. China.
Substantial epidemiological evidence suggests a significant correlation between particulate matter 2.5 (PM) and lung cancer. However, the mechanism underlying this association needs to be further elucidated.
View Article and Find Full Text PDFPLoS One
January 2025
School of Civil and Architectural Engineering, Harbin University, Harbin, China.
This work explores an intelligent field irrigation warning system based on the Enhanced Genetic Algorithm-Backpropagation Neural Network (EGA-BPNN) model in the context of smart agriculture. To achieve this, irrigation flow prediction in agricultural fields is chosen as the research topic. Firstly, the BPNN principles are studied, revealing issues such as sensitivity to initial values, susceptibility to local optima, and sample dependency.
View Article and Find Full Text PDFOpen Med (Wars)
January 2025
Department of Endocrinology, The Fourth Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu, 210000, China.
Background And Aim: Liver cancer is a prevalent and life-threatening condition, particularly among elderly individuals. The association between diabetes, a chronic metabolic disorder, and the onset and advancement of liver cancer has been widely acknowledged. However, the effect of diabetes on the survival of older patients with liver cancer has been a topic of debate.
View Article and Find Full Text PDFJ Patient Exp
January 2025
Physician Division, Emory Healthcare, Atlanta, GA, USA.
A large clinical practice group sought to create a unique Patient and Family Advisory Council (PFAC) recruitment and engagement model to support shifts in advisor expectations and support a medical group spread out across a large geographic area by providing rapid, custom patient and family feedback for quality, safety, and experience improvement. Patients are actively recruited through an online, automated application process linked to our patient surveys. Within 6 months of automated recruitment, the PFAC grew to over 200 members representing all clinical specialties and a variety of patient demographics, skills, and experiences.
View Article and Find Full Text PDFEndocr Oncol
January 2024
OCDEM, Radcliffe Department of Medicine, University of Oxford, Churchill Hospital, Oxford, UK.
Current models for the study of neuroendocrine tumours (NETs) are severely limited. While (e.g.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!