AI Article Synopsis

  • Recent deep learning syllabification models perform well for high-resource languages but struggle with low-resource languages like Indonesian due to limited datasets.
  • The authors propose two key strategies: massive data augmentation, which includes techniques such as transposing nuclei and swapping consonant-graphemes, and a phonotactic-based validation method, to enhance model performance.
  • Their findings reveal that applying data augmentation significantly boosts the dataset size and improves the model's accuracy, reducing the word error rate for both formal words and named entities in Indonesian.

Article Abstract

Recent deep learning-based syllabification models generally give low error rates for high-resource languages with big datasets but sometimes produce high error rates for the low-resource ones. In this paper, two procedures: massive data augmentation and validation, are proposed to improve a deep learning-based syllabification, using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF) for a low-resource Indonesian language. The massive data augmentation comprises four methods: transposing nuclei, swapping consonant-graphemes, flipping onsets, and creating acronyms. Meanwhile, the validation is implemented using a phonotactic-based scheme. A preliminary investigation on 50k Indonesian words informs that those augmentation methods significantly enlarge the dataset size by 12.8M valid words based on the phonotactic rules. An examination is then performed using 5-fold cross-validation. It reports that the augmentation methods significantly improve the BiLSTM-CNN-CRF model for 50k formal words and 100k named-entities datasets. A detailed investigation informs that augmenting the training set can reduce the word error rate (WER) coming from the long formal words and named entities.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8511842PMC
http://dx.doi.org/10.1016/j.heliyon.2021.e08115DOI Listing

Publication Analysis

Top Keywords

deep learning-based
12
improve deep
8
learning-based syllabification
8
error rates
8
massive data
8
data augmentation
8
augmentation methods
8
augmented improve
4
learning-based indonesian
4
indonesian syllabification
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!