This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11067374 | PMC |
http://dx.doi.org/10.1016/j.dib.2024.110413 | DOI Listing |
Data Brief
February 2025
Tashkent institute of textile and light industry, 5, Shoxdjaxon str., Tashkent city 100100, Uzbekistan.
In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.
View Article and Find Full Text PDFData Brief
June 2024
Jizzakh polytechnic institute, 4, Islom Karimov str., Jizzakh city, 130100, Uzbekistan.
This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek.
View Article and Find Full Text PDFData Brief
April 2024
National University of Uzbekistan named after Mirzo Ulugbek, Universitet Street, 4, Olmazor district, 100174, Tashkent city, Uzbekistan.
This paper presents a parallel corpus of raw texts between the Uzbek and Kazakh languages as a dataset for machine translation applications, focusing on the data collection process, dataset description, and its potential for reuse. The dataset-building process includes three separate stages, starting with a tiny portion of already available parallel data, then some more compiled from openly available resources like literature books, and web news texts, which were aligned using the sentence alignment method, encompassing a wide range of topics and genres. Finally, the majority of the dataset was taken from a raw text corpus in Uzbek and manually translated into Kazakh by a group of experts who are fluent in both languages.
View Article and Find Full Text PDFJ Mother Child
June 2023
Department of Obstetrics and Gynecology, South Brooklyn Health, Brooklyn, New York, USA.
Introduction: Assessing intentions, attitudes, and knowledge about breastfeeding among different language groups is important because the languages reflect cultural differences. We compared attitudes, subjective norms, perceived behavioural control, intentions, and knowledge of breastfeeding among mothers with the five most common preferred languages spoken at a New York City hospital.
Materials And Methods: This cross-sectional study surveyed women (n = 448) in the prenatal clinic and the post-partum unit of a New York City hospital.
J Exp Child Psychol
March 2024
Department of Applied Psychology and Human Development, University of Toronto, Toronto, ON M5S 1A1, Canada.
This study aimed to investigate the development of audiovisual speech perception in monolingual Uzbek-speaking and bilingual Uzbek-Russian-speaking children, focusing on the impact of language experience on audiovisual speech perception and the role of visual phonetic (i.e., mouth movements corresponding to phonetic/lexical information) and temporal (i.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!