Experimental study on short-text clustering using transformer-based semantic similarity measure.

PeerJ Comput Sci

Department of Mechanical and Industrial Engineering, American University of Ras Al Khaimah, Ras Al Khaimah, United Arab Emirates.

Published: May 2024

Sentence clustering plays a central role in various text-processing activities and has received extensive attention for measuring semantic similarity between compared sentences. However, relatively little focus has been placed on evaluating clustering performance using available similarity measures that adopt low-dimensional continuous representations. Such representations are crucial in domains like sentence clustering, where traditional word co-occurrence representations often achieve poor results when clustering semantically similar sentences that share no common words. This article presents a new implementation that incorporates a sentence similarity measure based on the notion of embedding representation for evaluating the performance of three types of text clustering methods: partitional clustering, hierarchical clustering, and fuzzy clustering, on standard textual datasets. This measure derives its semantic information from pre-training models designed to simulate human knowledge about words in natural language. The article also compares the performance of the used similarity measure by training it on two state-of-the-art pre-training models to investigate which yields better results. We argue that the superior performance of the selected clustering methods stems from their more effective use of the semantic information offered by this embedding-based similarity measure. Furthermore, we use hierarchical clustering, the best-performing method, for a text summarization task and report the results. The implementation in this article demonstrates that incorporating the sentence embedding measure leads to significantly improved performance in both text clustering and text summarization tasks.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11157522PMC
http://dx.doi.org/10.7717/peerj-cs.2078DOI Listing

Publication Analysis

Top Keywords

similarity measure
16
clustering
12
semantic similarity
8
sentence clustering
8
performance similarity
8
text clustering
8
clustering methods
8
hierarchical clustering
8
pre-training models
8
text summarization
8

Similar Publications

Traditional sedatives like Propofol can lead to adverse effects. This study compares the safety and efficacy of Ciprofol monotherapy versus combined Propofol for painless gastroscopy. Patients underwent painless gastroscopy at our hospital from January 2023 to December 2023 were studied.

View Article and Find Full Text PDF

Identifying influential nodes in real networks is significant in studying and analyzing the structural as well as functional aspects of networks. VoteRank is a simple and effective algorithm to identify high-spreading nodes. The accuracy and monotonicity of the VoteRank algorithm are poor as the network topology fails to be taken into account.

View Article and Find Full Text PDF

In the recent era, Lithium ion batteries plays a significant role in EV industry due to their high specific energy density, power density, low self-discharge rate, and prolonged lifespan. Modeling the battery precisely and estimating its State of Charge with great precision is essential to improve the performance of the lithium-ion batteries. Though numerous methods has been proposed for estimating the SOC, accurate estimation approach is not proposed yet since all these approaches consider the discrete-time dynamics of the battery.

View Article and Find Full Text PDF

Background: Hookah tobacco smoking is prevalent among youth and young adults. While health warning labels play a critical role in communicating the health risks of tobacco product use to consumers, compliance with US Federal Regulation's nicotine warning requirements on hookah tobacco packaging is low. Some labelling suggests that consumers are exposed to 'only 0.

View Article and Find Full Text PDF

Background: Same-day emergency care (SDEC) is an expanding area of hospital acute medical care. It aims to minimize delays and manage medical emergency patients within the same day, enabling hospitalization to be avoided; the expectation is that the patients would have required inpatient hospitalization in the absence of the SDEC service. Venous thromboembolism (VTE) prevention is a key medical inpatient safety measure.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!