AI Article Synopsis

  • The -gram syllabification model faces high syllable error rates in Bahasa Indonesia due to a large number of out-of-vocabulary words, while existing models BFO and CBSPS also struggle with syllable identification and vowel detection.
  • A new method, ASnGT, addresses these issues by applying syllabification at the grapheme level and eliminating reliance on vowel and diphthong detection, improving model performance significantly.
  • Despite its advantages for standard words and named entities, ASnGT still has challenges in accurately distinguishing derivative words and foreign language terms.

Article Abstract

As one of the statistical-based models, an -gram syllabification commonly gives a high syllable error rate (SER) for Bahasa Indonesia, one of the low-resource languages, since it fails for a high out-of-vocabulary (OOV) rate. Two previous models: bigram-syllabification with flipping onsets (BFO) and a combination of bigram with backoff smoothing based on phonological similarity (CBSPS), which use augmentation methods, can reduce the OOV rate. However, there are two problems in both BFO and CBSPS. First, they use an -gram that is applied syllable-level, instead of grapheme-level, so that they suffer on the sparsity of -grams. Second, they rely on a procedure to detect the positions of both vowels and diphthongs. Both problems make them not capable of distinguishing diphthongs from derivative words as well as syllabifying named-entities, which have many ambiguities related to vowels and semi-vowels. In this paper, a syllabification based on an -gram tagger, which is applied on grapheme-level and does not rely on both vowel and diphthong detections, is developed to solve both problems. Besides, three data augmentation methods are exploited to enrich the dataset. The 5-fold cross-validations (5-FCV) using both datasets of 50 k words and 15 k named-entities show that the proposed augmented-syllabification of -gram tagger (ASnGT) model is significantly better than both BFO and CBSPS. It is also significantly better than the fuzzy -nearest neighbor in every class (FkNNC)-based model for formal words and named-entities. However, it suffers from derivative words, where it cannot easily distinguish them from both absorption words and terms of foreign languages. Besides, it also undergoes some foreign named-entities.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9708824PMC
http://dx.doi.org/10.1016/j.heliyon.2022.e11922DOI Listing

Publication Analysis

Top Keywords

-gram tagger
12
augmented-syllabification -gram
8
oov rate
8
augmentation methods
8
bfo cbsps
8
named-entities
5
tagger indonesian
4
indonesian named-entities
4
named-entities statistical-based
4
statistical-based models
4

Similar Publications

Article Synopsis
  • The -gram syllabification model faces high syllable error rates in Bahasa Indonesia due to a large number of out-of-vocabulary words, while existing models BFO and CBSPS also struggle with syllable identification and vowel detection.
  • A new method, ASnGT, addresses these issues by applying syllabification at the grapheme level and eliminating reliance on vowel and diphthong detection, improving model performance significantly.
  • Despite its advantages for standard words and named entities, ASnGT still has challenges in accurately distinguishing derivative words and foreign language terms.
View Article and Find Full Text PDF

Captioning is the process of assembling a description for an image. Previous research on captioning has usually focused on foreground objects. In captioning concepts, there are two main objects for discussion: background object and foreground object.

View Article and Find Full Text PDF

Zika virus has caught the worlds attention, and has led people to share their opinions and concerns on social media like Twitter. Using text-based features, extracted with the help of Parts of Speech (POS) taggers and N-gram, a classifier was built to detect Zika related tweets from Twitter. With a simple logistic classifier, the system was successful in detecting Zika related tweets from Twitter with a 92% accuracy.

View Article and Find Full Text PDF

A large light-mass component of cosmic rays at 10(17)-10(17.5) electronvolts from radio observations.

Nature

March 2016

Max-Planck-Institut für Radioastronomie, Auf dem Hügel 69, 53121 Bonn, Germany.

Cosmic rays are the highest-energy particles found in nature. Measurements of the mass composition of cosmic rays with energies of 10(17)-10(18) electronvolts are essential to understanding whether they have galactic or extragalactic sources. It has also been proposed that the astrophysical neutrino signal comes from accelerators capable of producing cosmic rays of these energies.

View Article and Find Full Text PDF

Objective: In this study, presence of dentin infection in root canals, obturated with 4 techniques submitted to the bacterial leakage test, was evaluated using histologic methods.

Study Design: The canals of palatal roots of 160 molars were instrumented and divided into different groups, according to the obturation technique used (lateral condensation, MicroSeal system, Touch 'n Heat + Ultrafil, and Tagger's hybrid technique) and extent of the remaining obturation material (5 mm and 10 mm). Ten additional roots were used as control samples.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!