Augmented-syllabification of -gram tagger for Indonesian words and named-entities.

Suyanto Suyanto Andi Sunyoto Rezza Nafi Ismail Ade Romadhony Febryanti Sthevanie

Heliyon

School of Computing, Telkom University, Bandung, Indonesia.

Published: November 2022

The -gram syllabification model faces high syllable error rates in Bahasa Indonesia due to a large number of out-of-vocabulary words, while existing models BFO and CBSPS also struggle with syllable identification and vowel detection.
A new method, ASnGT, addresses these issues by applying syllabification at the grapheme level and eliminating reliance on vowel and diphthong detection, improving model performance significantly.
Despite its advantages for standard words and named entities, ASnGT still has challenges in accurately distinguishing derivative words and foreign language terms.

As one of the statistical-based models, an -gram syllabification commonly gives a high syllable error rate (SER) for Bahasa Indonesia, one of the low-resource languages, since it fails for a high out-of-vocabulary (OOV) rate. Two previous models: bigram-syllabification with flipping onsets (BFO) and a combination of bigram with backoff smoothing based on phonological similarity (CBSPS), which use augmentation methods, can reduce the OOV rate. However, there are two problems in both BFO and CBSPS. First, they use an -gram that is applied syllable-level, instead of grapheme-level, so that they suffer on the sparsity of -grams. Second, they rely on a procedure to detect the positions of both vowels and diphthongs. Both problems make them not capable of distinguishing diphthongs from derivative words as well as syllabifying named-entities, which have many ambiguities related to vowels and semi-vowels. In this paper, a syllabification based on an -gram tagger, which is applied on grapheme-level and does not rely on both vowel and diphthong detections, is developed to solve both problems. Besides, three data augmentation methods are exploited to enrich the dataset. The 5-fold cross-validations (5-FCV) using both datasets of 50 k words and 15 k named-entities show that the proposed augmented-syllabification of -gram tagger (ASnGT) model is significantly better than both BFO and CBSPS. It is also significantly better than the fuzzy -nearest neighbor in every class (FkNNC)-based model for formal words and named-entities. However, it suffers from derivative words, where it cannot easily distinguish them from both absorption words and terms of foreign languages. Besides, it also undergoes some foreign named-entities.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9708824	PMC
http://dx.doi.org/10.1016/j.heliyon.2022.e11922	DOI Listing

Publication Analysis

Top Keywords

-gram tagger

augmented-syllabification -gram

oov rate

augmentation methods

bfo cbsps

named-entities

tagger indonesian

indonesian named-entities

named-entities statistical-based

statistical-based models

Similar Publications

Augmented-syllabification of -gram tagger for Indonesian words and named-entities.

Heliyon

November 2022

School of Computing, Telkom University, Bandung, Indonesia.

Suyanto Suyanto Andi Sunyoto Rezza Nafi Ismail Ade Romadhony Febryanti Sthevanie

Article Synopsis

The -gram syllabification model faces high syllable error rates in Bahasa Indonesia due to a large number of out-of-vocabulary words, while existing models BFO and CBSPS also struggle with syllable identification and vowel detection.
A new method, ASnGT, addresses these issues by applying syllabification at the grapheme level and eliminating reliance on vowel and diphthong detection, improving model performance significantly.
Despite its advantages for standard words and named entities, ASnGT still has challenges in accurately distinguishing derivative words and foreign language terms.

View Article and Find Full Text PDF

Similar Publications

Hybrid of Deep Learning and Word Embedding in Generating Captions: Image-Captioning Solution for Geological Rock Images.

J Imaging

October 2022

Department of Informatics, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha No.10, Bandung 40132, Indonesia.

Agus Nursikuwagus Rinaldi Munir Masayu Leylia Khodra

Captioning is the process of assembling a description for an image. Previous research on captioning has usually focused on foreground objects. In captioning concepts, there are two main objects for discussion: background object and foreground object.

View Article and Find Full Text PDF

Similar Publications

Discovering explanatory models to identify relevant tweets on Zika.

Annu Int Conf IEEE Eng Med Biol Soc

July 2017

Roopteja Muppalla Michele Miller Tanvi Banerjee William Romine

Zika virus has caught the worlds attention, and has led people to share their opinions and concerns on social media like Twitter. Using text-based features, extracted with the help of Parts of Speech (POS) taggers and N-gram, a classifier was built to detect Zika related tweets from Twitter. With a simple logistic classifier, the system was successful in detecting Zika related tweets from Twitter with a 92% accuracy.

View Article and Find Full Text PDF

Similar Publications

A large light-mass component of cosmic rays at 10(17)-10(17.5) electronvolts from radio observations.

Nature

March 2016

Max-Planck-Institut für Radioastronomie, Auf dem Hügel 69, 53121 Bonn, Germany.

S Buitink A Corstanje H Falcke J R Hörandel T Huege

Cosmic rays are the highest-energy particles found in nature. Measurements of the mass composition of cosmic rays with energies of 10(17)-10(18) electronvolts are essential to understanding whether they have galactic or extragalactic sources. It has also been proposed that the astrophysical neutrino signal comes from accelerators capable of producing cosmic rays of these energies.

View Article and Find Full Text PDF

Similar Publications

Bacterial leakage in obturated root canals-part 2: a comparative histologic and microbiologic analyses.

Oral Surg Oral Med Oral Pathol Oral Radiol Endod

May 2010

Postgraduate Student, Department of Endodontics, Bauru Dental School, University of São Paulo, Bauru, Brazil.

Viviane Haiub Brosco Norberti Bernardineli Sérgio Aparecido Torres Alberto Consolaro Clóvis Monteiro Bramante

Objective: In this study, presence of dentin infection in root canals, obturated with 4 techniques submitted to the bacterial leakage test, was evaluated using histologic methods.

Study Design: The canals of palatal roots of 160 molars were instrumented and divided into different groups, according to the obturation technique used (lateral condensation, MicroSeal system, Touch 'n Heat + Ultrafil, and Tagger's hybrid technique) and extent of the remaining obturation material (5 mm and 10 mm). Ten additional roots were used as control samples.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!