Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages.

Atabay Ziyaden Amir Yelenov Fuad Hajiyev Samir Rustamov Alexandr Pak

PeerJ Comput Sci

Kazakh-British Technical University, Almaty, Kazakhstan.

Published: March 2024

Background: In the domain of natural language processing (NLP), the development and success of advanced language models are predominantly anchored in the richness of available linguistic resources. Languages such as Azerbaijani, which is classified as a low-resource, often face challenges arising from limited labeled datasets, consequently hindering effective model training.

Methodology: The primary objective of this study was to enhance the effectiveness and generalization capabilities of news text classification models using text augmentation techniques. In this study, we solve the problem of working with low-resource languages using translations using the Facebook mBart50 model, as well as the Google Translate API and a combination of mBart50 and Google Translate thus expanding the capabilities when working with text.

Results: The experimental outcomes reveal a promising uptick in classification performance when models are trained on the augmented dataset compared with their counterparts using the original data. This investigation underscores the immense potential of combined data augmentation strategies to bolster the NLP capabilities of underrepresented languages. As a result of our research, we have published our labeled text classification dataset and pre-trained RoBERTa model for the Azerbaijani language.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11041965	PMC
http://dx.doi.org/10.7717/peerj-cs.1974	DOI Listing

Publication Analysis

Top Keywords

text classification

data augmentation

low-resource languages

google translate

text

text data

augmentation pre-trained

language

pre-trained language

model

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered