Gene name identification and normalization using a model organism database.

J Biomed Inform

MITRE Corporation, 202 Burlington Road, Mail Stop K325, Bedford, MA 01730-1420, USA.

Published: December 2004

Biology has now become an information science, and researchers are increasingly dependent on expert-curated biological databases to organize the findings from the published literature. We report here on a series of experiments related to the application of natural language processing to aid in the curation process for FlyBase. We focused on listing the normalized form of genes and gene products discussed in an article. We broke this into two steps: gene mention tagging in text, followed by normalization of gene names. For gene mention tagging, we adopted a statistical approach. To provide training data, we were able to reverse engineer the gene lists from the associated articles and abstracts, to generate text labeled (imperfectly) with gene mentions. We then evaluated the quality of the noisy training data (precision of 78%, recall 88%) and the quality of the HMM tagger output trained on this noisy data (precision 78%, recall 71%). In order to generate normalized gene lists, we explored two approaches. First, we explored simple pattern matching based on synonym lists to obtain a high recall/low precision system (recall 95%, precision 2%). Using a series of filters, we were able to improve precision to 50% with a recall of 72% (balanced F-measure of 0.59). Our second approach combined the HMM gene mention tagger with various filters to remove ambiguous mentions; this approach achieved an F-measure of 0.72 (precision 88%, recall 61%). These experiments indicate that the lexical resources provided by FlyBase are complete enough to achieve high recall on the gene list task, and that normalization requires accurate disambiguation; different strategies for tagging and normalization trade off recall for precision.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jbi.2004.08.010DOI Listing

Publication Analysis

Top Keywords

gene mention
12
gene
10
mention tagging
8
training data
8
gene lists
8
data precision
8
precision 78%
8
78% recall
8
precision
7
recall
7

Similar Publications

Congenital muscular dystrophies and myopathies: the leading cause of genetic muscular disorders in eleven Chinese families.

BMC Musculoskelet Disord

January 2025

Medical Genetic Diagnosis and Therapy Center, Fujian Maternity and Child Health Hospital, College of Clinical Medicine for Obstetrics and Gynecology and Pediatrics, Fujian Medical University, 18 Daoshan Road, Fuzhou, 350001, China.

Background: Congenital muscular dystrophies (CMDs) and myopathies (CMYOs) are a clinically and genetically heterogeneous group of neuromuscular disorders that share common features, such as muscle weakness, hypotonia, characteristic changes on muscle biopsy and motor retardation. In this study, we recruited eleven families with early-onset neuromuscular disorders in China, aimed to clarify the underlying genetic etiology.

Methods: Essential clinical tests, such as biomedical examination, electromyography and muscle biopsy, were applied to evaluate patient phenotypes.

View Article and Find Full Text PDF

Citronellol (CT) is a naturally occurring lipophilic monoterpenoid which has shown anticancer effects in numerous cancerous cell lines. This study was, therefore, designed to examine CT's potential as an anticancer agent against glioblastoma (GBM). Network pharmacology analysis was employed to identify potential anticancer targets of CT.

View Article and Find Full Text PDF

DisGeNet: a disease-centric interaction database among diseases and various associated genes.

Database (Oxford)

January 2025

School of Computer Science and Technology, Xidian University, 266 Xinglong Section of Xifeng Road, Xi'an, Shaanxi 710126, China.

The pathogenesis of complex diseases is intricately linked to various genes and network medicine has enhanced understanding of diseases. However, most network-based approaches ignore interactions mediated by noncoding RNAs (ncRNAs) and most databases only focus on the association between genes and diseases. Based on the mentioned questions, we have developed DisGeNet, a database focuses not only on the disease-associated genes but also on the interactions among genes.

View Article and Find Full Text PDF

Light is a vital regulator of photosynthesis, energy production, plant growth, and morphogenesis. Although these key physiological processes are well understood, the effects of light quality on the pigment content, oxidative stress, reactive oxygen species (ROS) production, antioxidant defense systems, and biomass yield of plants remain largely unexplored. In this study, we applied different light-emitting diode (LED) treatments, including white light, red light, blue light, and a red+blue (1:1) light combination, to evaluate the traits mentioned above in alfalfa ( L.

View Article and Find Full Text PDF

The paulownia tree belongs to the Paulowniaceae family. Paulownia has strong vitality; has strong adaptability to harsh environmental conditions; and can be used as building raw material, as well as processing drugs and having other purposes. In the research field of MYB transcription factors of the paulownia tree, it is rare to discuss the resistance to abiotic stress.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!