Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets.

Shikhar Vashishth Denis Newman-Griffis Rishabh Joshi Ritam Dutt Carolyn P Rosé

J Biomed Inform

Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA.

Published: September 2021

Objectives: Biomedical natural language processing tools are increasingly being applied for broad-coverage information extraction-extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standardized vocabularies requires choosing the best candidate concepts from large inventories covering dozens of types. This study presents a novel semantic type prediction module for biomedical NLP pipelines and two automatically-constructed, large-scale datasets with broad coverage of semantic types.

Methods: We experiment with five off-the-shelf biomedical NLP toolkits on four benchmark datasets for medical information extraction from scientific literature and clinical notes. All toolkits adopt a staged approach of mention detection followed by two stages of medical entity linking: (1) generating a list of candidate concepts, and (2) picking the best concept among them. We introduce a semantic type prediction module to alleviate the problem of overgeneration of candidate concepts by filtering out irrelevant candidate concepts based on the predicted semantic type of a mention. We present MedType, a fully modular semantic type prediction model which we integrate into the existing NLP toolkits. To address the dearth of broad-coverage training data for medical information extraction, we further present WikiMed and PubMedDS, two large-scale datasets for medical entity linking.

Results: Semantic type filtering improves medical entity linking performance across all toolkits and datasets, often by several percentage points of F-1. Further, pretraining MedType on our novel datasets achieves state-of-the-art performance for semantic type prediction in biomedical text.

Conclusions: Semantic type prediction is a key part of building accurate NLP pipelines for broad-coverage information extraction from biomedical text. We make our source code and novel datasets publicly available to foster reproducible research.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8952339	PMC
http://dx.doi.org/10.1016/j.jbi.2021.103880	DOI Listing

Publication Analysis

Top Keywords

semantic type

type prediction

medical entity

candidate concepts

entity linking

large-scale datasets

semantic

medical

type

prediction module

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered