Background: The ability to express the same meaning in different ways is a well-known property of natural language. This amazing property is the source of major difficulties in natural language processing. Given the constant increase in published literature, its curation and information extraction would strongly benefit from efficient automatic processes, for which corpora of sentences evaluated by experts are a valuable resource.
Results: Given our interest in applying such approaches to the benefit of curation of the biomedical literature, specifically that about gene regulation in microbial organisms, we decided to build a corpus with graded textual similarity evaluated by curators and that was designed specifically oriented to our purposes. Based on the predefined statistical power of future analyses, we defined features of the design, including sampling, selection criteria, balance, and size, among others. A non-fully crossed study design was applied. Each pair of sentences was evaluated by 3 annotators from a total of 7; the scale used in the semantic similarity assessment task within the Semantic Evaluation workshop (SEMEVAL) was adapted to our goals in four successive iterative sessions with clear improvements in the agreed guidelines and interrater reliability results. Alternatives for such a corpus evaluation have been widely discussed.
Conclusions: To the best of our knowledge, this is the first similarity corpus-a dataset of pairs of sentences for which human experts rate the semantic similarity of each pair-in this domain of knowledge. We have initiated its incorporation in our research towards high-throughput curation strategies based on natural language processing.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6532127 | PMC |
http://dx.doi.org/10.1186/s13326-019-0200-x | DOI Listing |
J Gastroenterol Hepatol
January 2025
Department of Epidemiology and Biostatistics, School of Public Health, Xi'an Jiaotong University, Xi'an, Shaanxi, China.
Background And Aim: Colorectal cancer (CRC) is a significant global health burden, and screening can greatly reduce CRC incidence and mortality. Previous studies investigated the economic effects of CRC screening. We performed a systematic review to provide the cost-effectiveness of CRC screening strategies across countries with different income levels.
View Article and Find Full Text PDFAnn Transl Med
December 2024
Post-Graduation Department, Faculty of Medical Sciences of Minas Gerais, Belo Horizonte, Brazil.
Background And Objective: Sarcopenia, characterized by the progressive loss of skeletal muscle mass (MM) and muscle function, is a common and debilitating condition in cancer patients, significantly impacting their quality of life, treatment outcomes, and overall survival. The pathophysiology of sarcopenia is multifactorial, involving metabolic, hormonal, and inflammatory changes. Recent research highlights the role of chronic inflammation in the development and progression of sarcopenia, with pro-inflammatory cytokines being key mediators of muscle catabolism.
View Article and Find Full Text PDFNarra J
December 2024
Research Group of Pharmaceutics, School of Pharmacy, Institut Teknologi Bandung, Bandung, Indonesia.
Zebrafish serve as a pivotal model for bioimaging and toxicity assessments; however, the toxicity of banana peel-derived carbon dots in zebrafish has not been previously reported. The aim of this study was to assess the toxicity of carbon dots derived from banana peel in zebrafish, focusing on two types prepared through hydrothermal and pyrolysis methods. Banana peels were synthesized using hydrothermal and pyrolysis techniques and then compared for characteristics, bioimaging ability, and toxicity in zebrafish as an animal model.
View Article and Find Full Text PDFPhilos Trans A Math Phys Eng Sci
January 2025
Indian Institute of Technology Gandhinagar, Gandhinagar, Gujarat, India.
Modern language models such as bidirectional encoder representations from transformers have revolutionized natural language processing (NLP) tasks but are computationally intensive, limiting their deployment on edge devices. This paper presents an energy-efficient accelerator design tailored for encoder-based language models, enabling their integration into mobile and edge computing environments. A data-flow-aware hardware accelerator design for language models inspired by Simba, makes use of approximate fixed-point POSIT-based multipliers and uses high bandwidth memory (HBM) in achieving significant improvements in computational efficiency, power consumption, area and latency compared to the hardware-realized scalable accelerator Simba.
View Article and Find Full Text PDFJ Cheminform
January 2025
Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, University of Bonn, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany.
Analogue series (AS) are generated during compound optimization in medicinal chemistry and are the major source of structure-activity relationship (SAR) information. Pairs of active AS consisting of compounds with corresponding substituents and comparable potency progression represent SAR transfer events for the same target or across different targets. We report a new computational approach to systematically search for SAR transfer series that combines an AS alignment algorithm with context-depending similarity assessment based on vector embeddings adapted from natural language processing.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!