Symbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text data is presented. Text chunking refers to the task of splitting a string of textual information into non-overlapping groups of related units. This is a fundamental problem in numerous fields where understanding the relation between raw units of symbolic sequential data is relevant. Existing methods are based primarily on supervised and semi-supervised learning approaches; however, in this study, a novel unsupervised approach is proposed based on the existing concept of n-grams, which requires no labeled text as an input. The proposed methodology is applied to two natural language corpora: a Wall Street Journal corpus and a Twitter corpus. In both cases, the corpus length was increased gradually to measure the accuracy with a different number of unitary elements as inputs. Both corpora reveal improvements in accuracy proportional with increases in the number of tokens. For the Twitter corpus, the increase in accuracy follows a linear trend. The results show that the proposed methodology can achieve a higher accuracy with incremental usage. A future study will aim at designing an iterative system for the proposed methodology.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7279599PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0234214PLOS

Publication Analysis

Top Keywords

proposed methodology
12
idiomatic units
8
units symbolic
8
natural language
8
approach chunking
8
symbolic sequential
8
sequential data
8
unsupervised approach
8
twitter corpus
8
unsupervised acquisition
4

Similar Publications

The identification of neoantigens is crucial for advancing vaccines, diagnostics, and immunotherapies. Despite this importance, a fundamental question remains: how to model the presentation of neoantigens by major histocompatibility complex class I molecules and the recognition of the peptide-MHC-I (pMHC-I) complex by T cell receptors (TCRs). Accurate prediction of pMHC-I binding and TCR recognition remains a significant computational challenge in immunology due to intricate binding motifs and the long-tail distribution of known binding pairs in public databases.

View Article and Find Full Text PDF

A comprehensive benchmarking for evaluating TCR embeddings in modeling TCR-epitope interactions.

Brief Bioinform

November 2024

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong, 999077, China.

The complexity of T cell receptor (TCR) sequences, particularly within the complementarity-determining region 3 (CDR3), requires efficient embedding methods for applying machine learning to immunology. While various TCR CDR3 embedding strategies have been proposed, the absence of their systematic evaluations created perplexity in the community. Here, we extracted CDR3 embedding models from 19 existing methods and benchmarked these models with four curated datasets by accessing their impact on the performance of TCR downstream tasks, including TCR-epitope binding affinity prediction, epitope-specific TCR identification, TCR clustering, and visualization analysis.

View Article and Find Full Text PDF

Background: Nasal high flow (NHF) has been proposed to sustain high intensity exercise in people with COPD, but we have a poor understanding of its physiological effects in this clinical setting.

Research Question: What is the effect of NHF during exercise on dynamic respiratory muscle function and activation, cardiorespiratory parameters, endurance capacity, dyspnoea and leg fatigue as compared to control intervention.

Study Design And Methods: Randomized single-blind crossover trial including COPD patients.

View Article and Find Full Text PDF

Background: The implementation of large language models (LLMs), such as BART (Bidirectional and Auto-Regressive Transformers) and GPT-4, has revolutionized the extraction of insights from unstructured text. These advancements have expanded into health care, allowing analysis of social media for public health insights. However, the detection of drug discontinuation events (DDEs) remains underexplored.

View Article and Find Full Text PDF

Retrosynthesis is a strategy to analyze the synthetic routes for target molecules in medicinal chemistry. However, traditional retrosynthesis predictions performed by chemists and rule-based expert systems struggle to adapt to the vast chemical space of real-world scenarios. Artificial intelligence (AI) has revolutionized retrosynthesis prediction in recent decades, significantly increasing the accuracy and diversity of predictions for target compounds.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!