Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n-gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character or a byte, and uses the information gathered by the language model to determine if a boundary must be placed in the current position or not. Our aim is to use this system in a preprocessing step for a microtext normalization system. This means that it needs to effectively cope with the data sparsity present on this kind of texts. We also strove to surpass the performance of two readily available word segmentation systems: The well-known and accessible Word Breaker by Microsoft, and the Python module WordSegment by Grant Jenks. The results show that we have met our objectives, and we hope to continue to improve both the precision and the efficiency of our system in the future.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6360409 | PMC |
http://dx.doi.org/10.1002/asi.24082 | DOI Listing |
Sci Rep
January 2025
Department of Clinical Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
We aimed to develop and evaluate Explainable Artificial Intelligence (XAI) for fetal ultrasound using actionable concepts as feedback to end-users, using a prospective cross-center, multi-level approach. We developed, implemented, and tested a deep-learning model for fetal growth scans using both retrospective and prospective data. We used a modified Progressive Concept Bottleneck Model with pre-established clinical concepts as explanations (feedback on image optimization and presence of anatomical landmarks) as well as segmentations (outlining anatomical landmarks).
View Article and Find Full Text PDFCardiovasc Diagn Ther
December 2024
Clinic for Diagnostic and Interventional Radiology, Heidelberg University Hospital, Heidelberg, Germany.
Background: Computed tomography pulmonary angiography (CTPA) is frequently performed in patients with pulmonary hypertension (PH) and may aid non-invasive estimation of pulmonary hemodynamics. We, therefore, investigated automated volumetry of intrapulmonary vasculature on CTPA, separated into core and peel fractions of the lung volume and its potential to differentially reflect pulmonary hemodynamics in patients with pre- and postcapillary PH.
Methods: A retrospective case-control study of 72 consecutive patients with PH according to the 2022 joint guidelines of the European Society of Cardiology and the European Respiratory Society who underwent right heart catheterization (RHC) and CTPA within 7 days between August 2013 and February 2016 at Thoraxklinik at Heidelberg University Hospital (Heidelberg, Germany) was conducted.
JMIR Form Res
January 2025
Centre for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, China.
Background: With the development of artificial intelligence (AI), medicine has entered the era of intelligent medicine, and various aspects, such as medical education and talent cultivation, are also being redefined. The cultivation of clinical thinking abilities poses a formidable challenge even for seasoned clinical educators, as offline training modalities often fall short in bridging the divide between current practice and the desired ideal. Consequently, there arises an imperative need for the expeditious development of a web-based database, tailored to empower physicians in their quest to learn and hone their clinical reasoning skills.
View Article and Find Full Text PDFGenes (Basel)
December 2024
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China.
Background/objectives: Understanding the relationship between DNA sequences and gene expression levels is of significant biological importance. Recent advancements have demonstrated the ability of deep learning to predict gene expression levels directly from genomic data. However, traditional methods are limited by basic word encoding techniques, which fail to capture the inherent features and patterns of DNA sequences.
View Article and Find Full Text PDFDev Sci
March 2025
Laboratoire de Sciences Cognitives et de Psycholinguistique, Département d'Études Cognitives, ENS, EHESS, CNRS, PSL University, Paris, France.
Before they even talk, infants become sensitive to the speech sounds of their native language and recognize the auditory form of an increasing number of words. Traditionally, these early perceptual changes are attributed to an emerging knowledge of linguistic categories such as phonemes or words. However, there is growing skepticism surrounding this interpretation due to limited evidence of category knowledge in infants.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!