In this paper, we propose a compression-based anomaly detection method for time series and sequence data using a pattern dictionary. The proposed method is capable of learning complex patterns in a training data sequence, using these learned patterns to detect potentially anomalous patterns in a test data sequence. The proposed pattern dictionary method uses a measure of complexity of the test sequence as an anomaly score that can be used to perform stand-alone anomaly detection. We also show that when combined with a universal source coder, the proposed pattern dictionary yields a powerful atypicality detector that is equally applicable to anomaly detection. The pattern dictionary-based atypicality detector uses an anomaly score defined as the difference between the complexity of the test sequence data encoded by the trained pattern dictionary (typical) encoder and the universal (atypical) encoder, respectively. We consider two complexity measures: the number of parsed phrases in the sequence, and the length of the encoded sequence (codelength). Specializing to a particular type of universal encoder, the Tree-Structured Lempel-Ziv (LZ78), we obtain a novel non-asymptotic upper bound, in terms of the Lambert W function, on the number of distinct phrases resulting from the LZ78 parser. This non-asymptotic bound determines the range of anomaly score. As a concrete application, we illustrate the pattern dictionary framework for constructing a baseline of health against which anomalous deviations can be detected.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9407188 | PMC |
http://dx.doi.org/10.3390/e24081095 | DOI Listing |
BMC Bioinformatics
January 2025
Centro de Salud Retiro, Hospital Universitario Gregorio Marañon, C/Lope de Rueda, 43, 28009, Madrid, Spain.
Background: Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish.
View Article and Find Full Text PDFBehav Res Methods
January 2025
Department of Chinese Language and Literature, College of Humanities, Southwest Jiaotong University, No. 999, Xi'an Road, Pidu District, Chengdu, Sichuan Province, 611756, The People's Republic of China.
The degree of semantic equivalence of translation pairs is typically measured by asking bilinguals to rate the semantic similarity of them or comparing the number and meaning of dictionary entries. Such measures are subjective, labor-intensive, and unable to capture the fine-grained variation in the degree of semantic equivalence. Thompson et al.
View Article and Find Full Text PDFDrugs Real World Outcomes
January 2025
School of Pharmacy, College of Medicine, National Taiwan University, Taipei, 100025, Taiwan.
Background And Objectives: Accumulating pediatric efficacy and safety data on drug use is inherently challenging yet essential. This study aimed to analyze the frequency and compute the odds of pediatric drug-associated liver injury across age groups (early childhood, middle childhood, and adolescence) and therapeutic categories using adverse drug reactions (ADRs) reporting data spanning nearly two decades.
Methods: We analyzed the reports of suspected ADRs occurring in children and adolescents in the Taiwan National Adverse Drug Reaction Reporting System during the period from May 1998 until July 2017.
Sci Rep
January 2025
National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China.
Promoters are essential DNA sequences that initiate transcription and regulate gene expression. Precisely identifying promoter sites is crucial for deciphering gene expression patterns and the roles of gene regulatory networks. Recent advancements in bioinformatics have leveraged deep learning and natural language processing (NLP) to enhance promoter prediction accuracy.
View Article and Find Full Text PDFSci Rep
December 2024
Department of CSE, Adama Science and Technology University, Oromia, Ethiopia.
Afaan Oromo is a resource-scarce language with limited tools developed for its processing, posing significant challenges for natural language tasks. The tools designed for English do not work efficiently for Afaan Oromo due to the linguistic differences and lack of well-structured resources. To address this challenge, this work proposes a topic modeling framework for unstructured health-related documents in Afaan Oromo using latent dirichlet allocation (LDA) algorithms.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!