NCBI disease corpus: a resource for disease name recognition and concept normalization.

J Biomed Inform

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. Electronic address:

Published: February 2014

Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951655PMC
http://dx.doi.org/10.1016/j.jbi.2013.12.006DOI Listing

Publication Analysis

Top Keywords

ncbi disease
20
disease corpus
20
disease
13
natural language
12
corpus
8
disease recognition
8
language processing
8
disease concept
8
disease mentions
8
ncbi
5

Similar Publications

causes hospital-acquired infections in human patients with compromised immune system. Strains associated to nosocomial infections are often resistant to carbapenems and belong to few international clones (IC1-11). .

View Article and Find Full Text PDF

Rhus chinensis, a deciduous tree of the genus Rhus (family Anacardiaceae), is widely cultivated in China for its medicinal, edible, and ornamental value (Zhang et al., 2022). In April 2022, symptoms of winged leaf dieback disease were observed at Southwest Forestry University in Kunming, Yunnan Province, China (E102°45'42.

View Article and Find Full Text PDF

The Endolift® technique, introduced in 2005, gained popularity among medical and non-medical professionals as a non-surgical approach using subdermal laser devices. However, its widespread adoption lacked a thorough understanding of its physiological interaction, resulting in controversies over its effectiveness and safety. This study aimed to assess the evidence of Endolift® efficacy, parametrization, and safety by analyzing adverse events.

View Article and Find Full Text PDF

Genomic insights into a multidrug-resistant Pandoraea apista clinical isolate carrying bla from China.

J Glob Antimicrob Resist

January 2025

Clinical Laboratory Department, Lishui People's Hospital, the Sixth Affiliated Hospital of Wenzhou Medical University, Lishui, China. Electronic address:

Objectives: Pandoraea apista is notable for its multidrug resistance and is frequently identified in patients with cystic fibrosis or other chronic lung diseases, where it contributes to persistent lung infections. In this study, we describe a strain of P. apista harboring the bla, isolated from the bronchoalveolar lavage (BAL) fluid of an inpatient in China.

View Article and Find Full Text PDF

Establishment and Validation of the Diagnostic Value of Oligodendrocyte-related Genes in Alzheimer's Disease.

CNS Neurol Disord Drug Targets

January 2025

Institute of Traditional Chinese Medicine, Chengde Medical College, Chengde, 067000, China.

Background: AD is a demyelinating disease. Myelin damage initiates the pathological process of AD, resulting in abnormal synaptic function and cognitive decline. The myelin sheath formed by oligodendrocytes (OL) is a crucial component of white matter.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!