Grammar-aware phrase dataset generated using a novel python package.

Data Brief

Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai 600062, Tamil Nadu, India.

Published: June 2023

The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed "Oromo-grammar" a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10293991PMC
http://dx.doi.org/10.1016/j.dib.2023.109237DOI Listing

Publication Analysis

Top Keywords

phrase dataset
8
novel python
8
python package
8
web scraping
8
dataset
5
grammar-aware phrase
4
dataset generated
4
generated novel
4
package technique
4
technique manual
4

Similar Publications

Ghadeer-speech-crowd-corpus: Speech dataset.

Data Brief

February 2025

Computer Science Department, College of Science, University of Baghdad, Iraq.

The availability of raw data is a considerable challenge across most branches of science. In the absence of data, neither experiments can be conducted nor development can be undertaken. Despite their importance, raw data are still lacking across many scientific fields.

View Article and Find Full Text PDF

This dataset was created to investigate the impact of data collection modes and pre-processing techniques on the quality of free comment data related to consumers' sensory perceptions. A total of 200 consumers were recruited and divided into two groups of 100. Each group evaluated six madeleine samples (five distinct samples and one replicate) in a sensory analysis laboratory, using different free comment data collection modes.

View Article and Find Full Text PDF

Background: Up to 13% of adolescents suffer from depressive disorders. Despite the high psychological burden, adolescents rarely decide to contact child and adolescent psychiatric services. To provide a low-barrier alternative, our long-term goal is to develop a chatbot for early identification of depressive symptoms.

View Article and Find Full Text PDF

Background: Ischemic heart disease is a leading cause of death globally with a disproportionate burden in low- and middle-income countries (LMICs). Natural language processing (NLP) allows for data enrichment in large datasets to facilitate key clinical research. We used NLP to assess gender differences in symptoms and management of patients hospitalized with acute myocardial infarction (AMI) at Aga Khan University Hospital-Pakistan.

View Article and Find Full Text PDF

Background: Diagnosing rare genetic disorders relies on precise phenotypic and genotypic analysis, with the Human Phenotype Ontology (HPO) providing a standardized language for capturing clinical phenotypes. Traditional HPO tools, such as Doc2HPO and ClinPhen, employ concept recognition to automate phenotype extraction but struggle with incomplete phenotype assignment, often requiring intensive manual review. While large language models (LLMs) hold promise for more context-driven phenotype extraction, they are prone to errors and "hallucinations," making them less reliable without further refinement.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!