Protocol for a reproducible experimental survey on biomedical sentence similarity.

Alicia Lara-Clares Juan J Lastra-Díaz Ana Garcia-Serrano

PLoS One

NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain.

Published: October 2021

Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7990182	PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0248663	PLOS

Publication Analysis

Top Keywords

sentence similarity

biomedical sentence

similarity methods

experimental survey

similarity

sentence

reproducible experimental

biomedical

survey biomedical

biomedical domain

Similar Publications

Development, Reliability, and Concurrent Validity of the American Sign Language Version of the Computerized Revised Token Test.

J Speech Lang Hear Res

January 2025

Department of Communication Science and Disorders, University of Pittsburgh, PA.

Emily B Goldberg Sheila R Pratt Malcolm R McNeil Neil Szuminsky Kenneth DeHaan

Purpose: The present study assessed the test-retest reliability of the American Sign Language (ASL) version of the Computerized Revised Token Test (CRTT-ASL) and compared the differences and similarities between ASL and English reading by Deaf and hearing users of ASL.

Method: Creation of the CRTT-ASL involved filming, editing, and validating CRTT instructions, sentence commands, and scoring. Deaf proficient (DP), hearing nonproficient (HNP), and hearing proficient sign language users completed the CRTT-ASL and the English self-paced, word-by-word reading CRTT (CRTT-Reading-Word Fade [CRTT-R-wf]).

View Article and Find Full Text PDF

Similar Publications

Development and internal validation of a diagnostic prediction model for life-threatening events in callers with shortness of breath: a cross-sectional study in out-of-hours primary care.

Br J Gen Pract

January 2025

University Medical Centre Utrecht, Department of General Practice & Nursing Sciences, Julius Center for Health Sciences and Primary Care, Utrecht, Netherlands.

Michelle Spek Roderick P Venekamp Anne A H de Hond Esther de Groot Geert-Jan Geersing

Aim: To develop and internally validate a model predicting life-threatening events for out-of-hours primary care callers with shortness of breath.

Method: This cross-sectional study includes data from 1,952 patients with shortness of breath who called out-of-hours primary care between September 2020 and August 2021. Four logistic regression models were developed with life-threatening events as the outcome.

View Article and Find Full Text PDF

Similar Publications

Automated compilation of Urdu poetry handwritten image datasets for optical character recognition.

MethodsX

June 2025

Computer Science Department, Information Technology University of Punjab, Lahore, Pakistan.

Irtaza Ijaz Abdallah Namoun Nasser Aljohani Meshari Huwaytim Alanazi Mohammad N Alanazi

Optical character recognition (OCR) is vital in digitizing printed data into a digital format, which can be conveniently used for various purposes. A significant amount of work has been done in OCR for well-resourced languages like English. However, languages like Urdu, spoken by a large community, face limitations in OCR due to a lack of resources and the complexity and diversity of handwritten scripts.

View Article and Find Full Text PDF

Similar Publications

Informal Caregivers Connecting on the Web: Content Analysis of Posts on Discussion Forums.

JMIR Form Res

January 2025

School of Health Studies, Northern Illinois University, DeKalb, IL, United States.

Michelle L Foster Chinenye Egwuonwu Erin Vernon Mohammad Alarifi M Courtney Hughes

Background: About 53 million adults in the United States offer informal care to family and friends with disease or disability. Such care has an estimated economic value of US $600 million. Most informal caregivers are not paid nor trained in caregiving, with many experiencing higher-than-average levels of stress and depression and lower levels of physical health.

View Article and Find Full Text PDF

Similar Publications

The Beijing Sentence Corpus II: A cross-script comparison between traditional and simplified Chinese sentence reading.

Behav Res Methods

January 2025

Department of Sport and Health Sciences, University of Potsdam, Potsdam, Germany.

Ming Yan Jinger Pan Reinhold Kliegl

We introduce a sentence corpus with eye-movement data in traditional Chinese (TC), based on the original Beijing Sentence Corpus (BSC) in simplified Chinese (SC). The most noticeable difference between TC and SC character sets is their visual complexity. There are reaction time corpora in isolated TC character/word lexical decision and naming tasks.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!