Augmented words to improve a deep learning-based Indonesian syllabification.

Suyanto Suyanto Ade Romadhony Febryanti Sthevanie Rezza Nafi Ismail

Heliyon

School of Computing, Telkom University, Bandung, Indonesia.

Published: October 2021

Recent deep learning syllabification models perform well for high-resource languages but struggle with low-resource languages like Indonesian due to limited datasets.
The authors propose two key strategies: massive data augmentation, which includes techniques such as transposing nuclei and swapping consonant-graphemes, and a phonotactic-based validation method, to enhance model performance.
Their findings reveal that applying data augmentation significantly boosts the dataset size and improves the model's accuracy, reducing the word error rate for both formal words and named entities in Indonesian.

Recent deep learning-based syllabification models generally give low error rates for high-resource languages with big datasets but sometimes produce high error rates for the low-resource ones. In this paper, two procedures: massive data augmentation and validation, are proposed to improve a deep learning-based syllabification, using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF) for a low-resource Indonesian language. The massive data augmentation comprises four methods: transposing nuclei, swapping consonant-graphemes, flipping onsets, and creating acronyms. Meanwhile, the validation is implemented using a phonotactic-based scheme. A preliminary investigation on 50k Indonesian words informs that those augmentation methods significantly enlarge the dataset size by 12.8M valid words based on the phonotactic rules. An examination is then performed using 5-fold cross-validation. It reports that the augmentation methods significantly improve the BiLSTM-CNN-CRF model for 50k formal words and 100k named-entities datasets. A detailed investigation informs that augmenting the training set can reduce the word error rate (WER) coming from the long formal words and named entities.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8511842	PMC
http://dx.doi.org/10.1016/j.heliyon.2021.e08115	DOI Listing

Publication Analysis

Top Keywords

deep learning-based

improve deep

learning-based syllabification

error rates

massive data

data augmentation

augmentation methods

augmented improve

learning-based indonesian

indonesian syllabification

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!