Topic models are a class of unsupervised learning algorithms for detecting the semantic structure within a text corpus. Together with a subsequent dimensionality reduction algorithm, topic models can be used for deriving spatializations for text corpora as two-dimensional scatter plots, reflecting semantic similarity between the documents and supporting corpus analysis. Although the choice of the topic model, the dimensionality reduction, and their underlying hyperparameters significantly impact the resulting layout, it is unknown which particular combinations result in high-quality layouts with respect to accuracy and perception metrics. To investigate the effectiveness of topic models and dimensionality reduction methods for the spatialization of corpora as two-dimensional scatter plots (or basis for landscape-type visualizations), we present a large-scale, benchmark-based computational evaluation. Our evaluation consists of (1) a set of corpora, (2) a set of layout algorithms that are combinations of topic models and dimensionality reductions, and (3) quality metrics for quantifying the resulting layout. The corpora are given as document-term matrices, and each document is assigned to a thematic class. The chosen metrics quantify the preservation of local and global properties and the perceptual effectiveness of the two-dimensional scatter plots. By evaluating the benchmark on a computing cluster, we derived a multivariate dataset with over 45 000 individual layouts and corresponding quality metrics. Based on the results, we propose guidelines for the effective design of text spatializations that are based on topic models and dimensionality reductions. As a main result, we show that interpretable topic models are beneficial for capturing the structure of text corpora. We furthermore recommend the use of t-SNE as a subsequent dimensionality reduction.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TVCG.2023.3326569DOI Listing

Publication Analysis

Top Keywords

topic models
28
dimensionality reduction
20
models dimensionality
16
two-dimensional scatter
12
scatter plots
12
topic
8
reduction methods
8
structure text
8
subsequent dimensionality
8
text corpora
8

Similar Publications

The brain develops most rapidly during pregnancy and early neonatal months. While prior electrophysiological studies have shown that aperiodic brain activity undergoes changes across infancy to adulthood, the role of gestational duration in aperiodic and periodic activity remains unknown. In this study, we aimed to bridge this gap by examining the associations between gestational duration and aperiodic and periodic activity in the EEG power spectrum in both neonates and toddlers.

View Article and Find Full Text PDF

Introduction: In stressful times, people often listen to "coping songs" that help them reach emotional well-being goals. This paper is a first attempt to map the connection between an individual's well-being goals and their chosen coping song.

Methods: We assembled a large-scale dataset of 2,804 coping songs chosen by individuals from 11 countries during COVID-19 lockdown.

View Article and Find Full Text PDF

Objective: To determine whether neighborhood-level social determinants of health (SDoH) influence mortality following sepsis in the United States.

Study Setting And Design: Retrospective analysis of data from 4.4 million hospitalized patients diagnosed with sepsis, identified using International Classification of Diseases-10 codes, across the United States.

View Article and Find Full Text PDF

The aim of this study was to conduct a scoping and bibliometric review of articles using artificial intelligence (AI) in tennis. The analysis covered various aspects of tennis, including performance, health, match results, physiological data, tennis expenditure, and prize amounts. Articles on AI in tennis published until 2024 were retrieved from the Web of Science database.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!