Non-standardised early vernaculars present a problem for search tools due to the high degree of variation. The challenge lies in the variation found in orthography, syntax, and lexicon between titles, incipits, and explicits in manuscript copies of the same work. Traditional search methods relying on exact string matching or regular expressions fail to address these variations comprehensively. This project presents a web-based search tool specifically designed to handle linguistic and textual variation. The software is made available as a part of the (IMEP). The search tool addresses the issue of variation by utilizing a database of incipits and explicits, character-based n-gram language models (LMs) built with the (SRILM) toolkit, and a fuzzy search script (IMEP: FSS) written in Python. The tool optimizes for recall, retrieving multiple potential matches for a search string, without attempting to identify the 'correct' one. The search process involves looking up exact matches in the database while simultaneously using the fuzzy search script to evaluate the incipits and explicits against a model of the search string, followed by a match of the search string against models of the incipits and explicits. This two-step process shortens the processing time, which would otherwise be unreasonably long, because while using SRILM to match the search string against each incipit or explicit in the IMEP for precision could be time-consuming, running a first step where all texts are matched against a single LM built from the search string allows for faster processing. A web application, built using Django and Docker, combines the results of the direct database lookup and the fuzzy search script, presenting them as a list with exact matches followed by fuzzy matches ordered by increasing model perplexity. The tool is made available Open Access and can be adapted to other datasets.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10808851PMC
http://dx.doi.org/10.12688/openreseurope.16590.1DOI Listing

Publication Analysis

Top Keywords

search string
20
incipits explicits
16
search
14
search tool
12
fuzzy search
12
search script
12
exact matches
8
match search
8
string
6
tool based
4

Similar Publications

Background: Riparian zones are vital transitional habitats that bridge the gap between terrestrial and aquatic ecosystems. They support elevated levels of biodiversity and provide an array of important regulatory and provisioning ecosystem services, of which, many are fundamentally important to human well-being, such as the maintenance of water quality and the mitigation of flood risk along waterways. Increasing anthropogenic pressures resulting from agricultural intensification, industry development and the expansion of infrastructure in tropical regions have led to the widespread degradation of riparian habitats resulting in biodiversity loss and decreased resilience to flooding and erosion.

View Article and Find Full Text PDF

Background: Double J Stent is one of the procedures frequently performed in the field of urology. Forgotten DJ Stent is a problem that can cause serious complications. This systematic review aims to explore complications and management of patients with forgotten double J stents.

View Article and Find Full Text PDF

Evaporating Primordial Black Holes, the String Axiverse, and Hot Dark Radiation.

Phys Rev Lett

December 2024

Univ Coimbra, Faculdade de Ciências e Tecnologia da Universidade de Coimbra and CFisUC, Rua Larga, 3004-516 Coimbra, Portugal.

The search for primordial black holes (PBHs) with masses M≪M_{⊙} is motivated by natural early-Universe production mechanisms and that PBHs can be dark matter. For M≲10^{14}  kg, the PBH density is constrained by null searches for their expected Hawking emission (HE), the characteristics of which are, however, sensitive to new states beyond the standard model. If there exists a large number of spin-0 particles in nature, PBHs can, through HE, develop and maintain non-negligible spins, modifying the visible HE.

View Article and Find Full Text PDF
Article Synopsis
  • Increased consumer demand for non-animal-derived proteins has prompted the search for sustainable, alternative protein sources, which is crucial for both human and pet food systems.
  • A pilot study tested a novel, fermented protein ingredient derived from greenhouse gases in beagle dogs, showing it to be palatable at 5 and 10% diet inclusions with no significant adverse effects on the dogs' health and behaviors.
  • The results support the potential for incorporating such sustainable protein sources into the pet food industry, contributing to environmental sustainability and food security.
View Article and Find Full Text PDF

Digital Health Solutions for Cardiovascular Disease Prevention: Systematic Review.

J Med Internet Res

January 2025

Centre for Research in Media and Communication, Faculty of Social Sciences and Humanities, Universiti Kebangsaan Malaysia, Selangor, Malaysia.

Background: Cardiovascular disease (CVD) is a major global health issue, with approximately 70% of cases linked to modifiable risk factors. Digital health solutions offer potential for CVD prevention; yet, their effectiveness in covering the full range of prevention strategies is uncertain.

Objective: This study aimed to synthesize current literature on digital solutions for CVD prevention, identify the key components of effective digital interventions, and highlight critical research gaps to inform the development of sustainable strategies for CVD prevention.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!