Grammar-aware phrase dataset generated using a novel python package.

Data Brief

Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai 600062, Tamil Nadu, India.

Published: June 2023

The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed "Oromo-grammar" a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10293991	PMC
http://dx.doi.org/10.1016/j.dib.2023.109237	DOI Listing

Publication Analysis

Top Keywords

phrase dataset

novel python

python package

web scraping

dataset

grammar-aware phrase

dataset generated

generated novel

package technique

technique manual

Similar Publications

Ghadeer-speech-crowd-corpus: Speech dataset.

Data Brief

February 2025

Computer Science Department, College of Science, University of Baghdad, Iraq.

Ghadeer Qasim Ali Husam Ali Abdulmohsin

The availability of raw data is a considerable challenge across most branches of science. In the absence of data, neither experiments can be conducted nor development can be undertaken. Despite their importance, raw data are still lacking across many scientific fields.

View Article and Find Full Text PDF

Similar Publications

A dataset of annotated free comments on the sensory perception of madeleines for benchmarking text mining techniques.

Data Brief

February 2025

Oniris, INRAE, StatSC, 44300 Nantes, France.

Michel Visalli Ronan Symoneaux Cécile Mursic Margaux Touret Flore Lourtioux

This dataset was created to investigate the impact of data collection modes and pre-processing techniques on the quality of free comment data related to consumers' sensory perceptions. A total of 200 consumers were recruited and divided into two groups of 100. Each group evaluated six madeleine samples (five distinct samples and one replicate) in a sensory analysis laboratory, using different free comment data collection modes.

View Article and Find Full Text PDF

Similar Publications

M.I.N.I.-KID interviews with adolescents: a corpus-based language analysis of adolescents with depressive disorders and the possibilities of continuation using Chat GPT.

Front Psychiatry

December 2024

Department of Information Science, University of Regensburg, Regensburg, Germany.

Irina Jarvers Angelika Ecker Pia Donabauer Katharina Kampa Maximilian Weißenbacher

Background: Up to 13% of adolescents suffer from depressive disorders. Despite the high psychological burden, adolescents rarely decide to contact child and adolescent psychiatric services. To provide a low-barrier alternative, our long-term goal is to develop a chatbot for early identification of depressive symptoms.

View Article and Find Full Text PDF

Similar Publications

Identification of Gender Differences in Acute Myocardial Infarction Presentation and Management at Aga Khan University Hospital-Pakistan: Natural Language Processing Application in a Dataset of Patients With Cardiovascular Disease.

JMIR Form Res

December 2024

Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, United States.

Christine Ngaruiya Zainab Samad Salma Tajuddin Zarmeen Nasim Rebecca Leff

Background: Ischemic heart disease is a leading cause of death globally with a disproportionate burden in low- and middle-income countries (LMICs). Natural language processing (NLP) allows for data enrichment in large datasets to facilitate key clinical research. We used NLP to assess gender differences in symptoms and management of patients hospitalized with acute myocardial infarction (AMI) at Aga Khan University Hospital-Pakistan.

View Article and Find Full Text PDF

Similar Publications

Improving Automated Deep Phenotyping Through Large Language Models Using Retrieval Augmented Generation.

medRxiv

December 2024

Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.

Brandon T Garcia Lauren Westerfield Priya Yelemali Nikhita Gogate E Andres Rivera-Munoz

Background: Diagnosing rare genetic disorders relies on precise phenotypic and genotypic analysis, with the Human Phenotype Ontology (HPO) providing a standardized language for capturing clinical phenotypes. Traditional HPO tools, such as Doc2HPO and ClinPhen, employ concept recognition to automate phenotype extraction but struggle with incomplete phenotype assignment, often requiring intensive manual review. While large language models (LLMs) hold promise for more context-driven phenotype extraction, they are prone to errors and "hallucinations," making them less reliable without further refinement.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!