Exploiting and assessing multi-source data for supervised biomedical named entity recognition.

Bioinformatics

Computational and Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK.

Published: July 2018

Motivation: Recognition of biomedical entities from scientific text is a critical component of natural language processing and automated information extraction platforms. Modern named entity recognition approaches rely heavily on supervised machine learning techniques, which are critically dependent on annotated training corpora. These approaches have been shown to perform well when trained and tested on the same source. However, in such scenario, the performance and evaluation of these models may be optimistic, as such models may not necessarily generalize to independent corpora, resulting in potential non-optimal entity recognition for large-scale tagging of widely diverse articles in databases such as PubMed.

Results: Here we aggregated published corpora for the recognition of biomolecular entities (such as genes, RNA, proteins, variants, drugs and metabolites), identified entity class overlap and performed leave-corpus-out cross validation strategy to test the efficiency of existing models. We demonstrate that accuracies of models trained on individual corpora decrease substantially for recognition of the same biomolecular entity classes in independent corpora. This behavior is possibly due to limited generalizability of entity-class-related features captured by individual corpora (model 'overtraining') which we investigated further at the orthographic level, as well as potential annotation standard differences. We show that the combined use of multi-source training corpora results in overall more generalizable models for named entity recognition, while achieving comparable individual performance. By performing learning-curve-based power analysis we further identified that performance is often not limited by the quantity of the annotated data.

Availability And Implementation: Compiled primary and secondary sources of the aggregated corpora are available on: https://github.com/dterg/biomedical_corpora/wiki and https://bitbucket.org/iAnalytica/bioner.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6041968PMC
http://dx.doi.org/10.1093/bioinformatics/bty152DOI Listing

Publication Analysis

Top Keywords

entity recognition
16
named entity
12
corpora
8
training corpora
8
independent corpora
8
recognition biomolecular
8
individual corpora
8
recognition
7
entity
6
models
5

Similar Publications

Multinucleate cell angiohistiocytoma (MCAH) is a rare benign cutaneous entity. It classically presents as slowly progressive erythematous to violaceous papules on the distal extremities of middle-aged or elderly women. The entity may clinically resemble granuloma annulare, lichen planus, and several cutaneous vascular proliferations.

View Article and Find Full Text PDF

The antioxidant property of CAPE depends on TRPV1 channel activation in microvascular endothelial cells.

Redox Biol

January 2025

Laboratory for Research in Functional Nutrition, Instituto de Nutrición y Tecnología de los Alimentos, Universidad de Chile, Av. El Líbano 5524, Macul, Santiago, 7830490, Chile. Electronic address:

Caffeic acid phenethyl ester (CAPE) is a hydrophobic phytochemical typically found in propolis that acts as an antioxidant, anti-inflammatory and cardiovascular protector, among several other properties. However, the molecular entity responsible for recognising CAPE is unknown, and whether that molecular interaction is involved in developing an antioxidant response in the target cells remains an unanswered question. Herein, we hypothesized that a subfamily of TRP ion channels works as the molecular entity that recognizes CAPE at the plasma membrane and allows a fast shift in the antioxidant capacity of intact endothelial cells (EC).

View Article and Find Full Text PDF

G Protein-Coupled Receptor Heteromers in Brain: Functional and Therapeutic Importance in Neuropsychiatric Disorders.

Annu Rev Pharmacol Toxicol

January 2025

Department of Pharmacology and Toxicology, Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada; email:

G protein-coupled receptors (GPCRs) represent the largest family of plasma membrane proteins targeted for therapeutic development. For decades, GPCRs were investigated as monomeric entities during analysis of their pharmacology or signaling and during drug development. However, a considerable body of evidence now indicates that GPCRs function as dimers or higher-order oligomers.

View Article and Find Full Text PDF

Objectives: Deidentification of personally identifiable information in free-text clinical data is fundamental to making these data broadly available for research. However, there exist gaps in the deidentification landscape with regard to the functionality and flexibility of extant tools, as well as suboptimal tradeoffs between deidentification accuracy and speed. To address these gaps and tradeoffs, we develop a new Python-based deidentification software, pyDeid.

View Article and Find Full Text PDF

Objective: To provide primary care physicians with a review of common oral white lesions and a practical management algorithm.

Sources Of Information: Between January and April 2024 relevant literature and clinical guidelines were searched for using the PubMed MEDLINE database with no date limitation.

Main Message: A broad differential diagnosis exists for white lesions of the oral cavity.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!