A comprehensive dataset for Arabic word sense disambiguation.

Data Brief

Computing and Applied Technology, College of Technological Innovation, Zayed University, UAE.

Published: August 2024

This data paper introduces a comprehensive dataset tailored for word sense disambiguation tasks, explicitly focusing on a hundred polysemous words frequently employed in Modern Standard Arabic. The dataset encompasses a diverse set of senses for each word, ranging from 3 to 8, resulting in 367 unique senses. Each word sense is accompanied by contextual sentences comprising ten sentence examples that feature the polysemous word in various contexts. The data collection resulted in a dataset of 3670 samples. Significantly, the dataset is in Arabic, which is known for its rich morphology, complex syntax, and extensive polysemy. The data was meticulously collected from various web sources, spanning news, medicine, finance, and more domains. This inclusivity ensures the dataset's applicability across diverse fields, positioning it as a pivotal resource for Arabic Natural Language Processing (NLP) applications. The data collection timeframe spans from the first of April 2023 to the first of May 2023. The dataset provides comprehensive model learning by including all senses for a frequently used Arabic polysemous term, even rare senses that are infrequently used in real-world contexts, thereby mitigating biases. The dataset comprises synthetic sentences generated by GPT3.5-turbo, addressing instances where rare senses lack sufficient real-world data. The dataset collection process involved initial web scraping, followed by manual sorting to distinguish word senses, supplemented by thorough searches by a human expert to fill in missing contextual sentences. Finally, in instances where online data for rare word senses was lacking or insufficient, synthetic samples were generated. Beyond its primary utility in word sense disambiguation, this dataset holds considerable value for scientists and researchers across various domains, extending its relevance to sentiment analysis applications.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11222923PMC
http://dx.doi.org/10.1016/j.dib.2024.110591DOI Listing

Publication Analysis

Top Keywords

word sense
16
sense disambiguation
12
comprehensive dataset
8
dataset arabic
8
word
8
dataset
8
senses word
8
contextual sentences
8
data collection
8
rare senses
8

Similar Publications

Malware is a common word in modern era. Everyone using computer is aware of it. Some users have to face the problem known as Cyber crimes.

View Article and Find Full Text PDF

Background/objectives: Olfactory dysfunction (OD) is associated with a variety of neurologic deficits and impacts socialization decisions, mood, and overall quality of life. As a common symptom comprising the long COVID condition, persistent COVID-19-associated olfactory dysfunction (C19OD) may further impact the presentations of neuropsychiatric sequelae. Our study aims to characterize the longitudinal burden of depression, anxiety, and neuropsychiatric symptoms in a population with C19OD.

View Article and Find Full Text PDF

Individual differences elucidate the perceptual benefits associated with robust temporal fine-structure processing.

Proc Natl Acad Sci U S A

January 2025

Department of Communication Science and Disorders, University of Pittsburgh, Pittsburgh, PA 15213.

Article Synopsis
  • The auditory system can precisely track quick changes in sound, but the importance of this ability (temporal fine structure or TFS) for hearing is still debated.
  • Researchers studied 200 participants to see how TFS sensitivity affects speech perception in noisy environments.
  • Results showed that better TFS sensitivity helped more with listening in reverberant spaces and led to quicker responses, suggesting it plays a key role in everyday hearing experiences.
View Article and Find Full Text PDF

Objective: Chronic musculoskeletal pain (CMSP) is frequent in chronic diseases, decreasing the quality of life of these patients. In a survey conducted in Belgium in 2019, chronic pain was named by patients as the main factor of complexity in their lives. The objective of our research was to provide elements to understand why and how CMSP contributes to the complexity of these people's lives.

View Article and Find Full Text PDF

To explore "the lived experience" of patients with cancer through narratives, in-depth interviews with 20 patients were conducted in the patients' homes-"at the kitchen table." Interviews were audio-recorded, transcribed, and analyzed following the Linguistic Inquiry and Word Count (LIWC) methodology. Thematic Analysis was used to explore themes in the narratives.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!