Comparing neural- and N-gram-based language models for word segmentation.

J Assoc Inf Sci Technol

Grupo LYS, Departamento de Computación Facultade de Informática Universidade da Coruña, Campus de Elviña A Coruña 15071 Spain.

Published: February 2019

Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n-gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character or a byte, and uses the information gathered by the language model to determine if a boundary must be placed in the current position or not. Our aim is to use this system in a preprocessing step for a microtext normalization system. This means that it needs to effectively cope with the data sparsity present on this kind of texts. We also strove to surpass the performance of two readily available word segmentation systems: The well-known and accessible Word Breaker by Microsoft, and the Python module WordSegment by Grant Jenks. The results show that we have met our objectives, and we hope to continue to improve both the precision and the efficiency of our system in the future.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6360409PMC
http://dx.doi.org/10.1002/asi.24082DOI Listing

Publication Analysis

Top Keywords

word segmentation
12
language model
8
word
6
comparing neural-
4
neural- n-gram-based
4
language
4
n-gram-based language
4
language models
4
models word
4
segmentation word
4

Similar Publications

We aimed to develop and evaluate Explainable Artificial Intelligence (XAI) for fetal ultrasound using actionable concepts as feedback to end-users, using a prospective cross-center, multi-level approach. We developed, implemented, and tested a deep-learning model for fetal growth scans using both retrospective and prospective data. We used a modified Progressive Concept Bottleneck Model with pre-established clinical concepts as explanations (feedback on image optimization and presence of anatomical landmarks) as well as segmentations (outlining anatomical landmarks).

View Article and Find Full Text PDF

Background: Computed tomography pulmonary angiography (CTPA) is frequently performed in patients with pulmonary hypertension (PH) and may aid non-invasive estimation of pulmonary hemodynamics. We, therefore, investigated automated volumetry of intrapulmonary vasculature on CTPA, separated into core and peel fractions of the lung volume and its potential to differentially reflect pulmonary hemodynamics in patients with pre- and postcapillary PH.

Methods: A retrospective case-control study of 72 consecutive patients with PH according to the 2022 joint guidelines of the European Society of Cardiology and the European Respiratory Society who underwent right heart catheterization (RHC) and CTPA within 7 days between August 2013 and February 2016 at Thoraxklinik at Heidelberg University Hospital (Heidelberg, Germany) was conducted.

View Article and Find Full Text PDF

Artificial Intelligence-Powered Training Database for Clinical Thinking: App Development Study.

JMIR Form Res

January 2025

Centre for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, China.

Background: With the development of artificial intelligence (AI), medicine has entered the era of intelligent medicine, and various aspects, such as medical education and talent cultivation, are also being redefined. The cultivation of clinical thinking abilities poses a formidable challenge even for seasoned clinical educators, as offline training modalities often fall short in bridging the divide between current practice and the desired ideal. Consequently, there arises an imperative need for the expeditious development of a web-based database, tailored to empower physicians in their quest to learn and hone their clinical reasoning skills.

View Article and Find Full Text PDF

TExCNN: Leveraging Pre-Trained Models to Predict Gene Expression from Genomic Sequences.

Genes (Basel)

December 2024

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China.

Background/objectives: Understanding the relationship between DNA sequences and gene expression levels is of significant biological importance. Recent advancements have demonstrated the ability of deep learning to predict gene expression levels directly from genomic data. However, traditional methods are limited by basic word encoding techniques, which fail to capture the inherent features and patterns of DNA sequences.

View Article and Find Full Text PDF

Simulating Early Phonetic and Word Learning Without Linguistic Categories.

Dev Sci

March 2025

Laboratoire de Sciences Cognitives et de Psycholinguistique, Département d'Études Cognitives, ENS, EHESS, CNRS, PSL University, Paris, France.

Before they even talk, infants become sensitive to the speech sounds of their native language and recognize the auditory form of an increasing number of words. Traditionally, these early perceptual changes are attributed to an emerging knowledge of linguistic categories such as phonemes or words. However, there is growing skepticism surrounding this interpretation due to limited evidence of category knowledge in infants.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!