Parallel texts dataset for Uzbek-Kazakh machine translation.

Data Brief

National University of Uzbekistan named after Mirzo Ulugbek, Universitet Street, 4, Olmazor district, 100174, Tashkent city, Uzbekistan.

Published: April 2024

This paper presents a parallel corpus of raw texts between the Uzbek and Kazakh languages as a dataset for machine translation applications, focusing on the data collection process, dataset description, and its potential for reuse. The dataset-building process includes three separate stages, starting with a tiny portion of already available parallel data, then some more compiled from openly available resources like literature books, and web news texts, which were aligned using the sentence alignment method, encompassing a wide range of topics and genres. Finally, the majority of the dataset was taken from a raw text corpus in Uzbek and manually translated into Kazakh by a group of experts who are fluent in both languages. The resulting parallel corpus serves as a valuable resource for researchers and practitioners interested in Kazakh and Uzbek language processing tasks, particularly in the context of neural machine translation, where the presented data can be used for testing the rule-based machine translation models, or it can be used for both training statistical and neural machine translation models as well. The dataset has been made accessible through the widely recognized Hugging Face platform, a repository known for facilitating collaborative efforts and advancing Natural Language Processing (NLP) applications. This combination of methods to obtain a parallel corpus plays as a pivot for other languages among other low-resource Turkic languages.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10904177PMC
http://dx.doi.org/10.1016/j.dib.2024.110194DOI Listing

Publication Analysis

Top Keywords

machine translation
20
parallel corpus
12
language processing
8
neural machine
8
translation models
8
parallel
5
dataset
5
machine
5
translation
5
parallel texts
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!