GPAD: a natural language processing-based application to extract the gene-disease association discovery information from OMIM.

BMC Bioinformatics

Departments of Biochemistry, Molecular Biology and Medical Genetics, Cumming School of Medicine, University of Calgary, Calgary, AB, T2N 4N1, Canada.

Published: February 2024

AI Article Synopsis

  • The study developed a tool called GPAD using Natural Language Processing (NLP) to extract gene-disease association data from the Online Mendelian Inheritance in Man (OMIM) database, which primarily consists of textual information that is hard to analyze automatically.
  • GPAD successfully identifies when genes are linked to specific phenotypes, the validation methods used for these associations, and revealed a significant increase in discoveries linked to exome sequencing, but a decline in the rate of discoveries over the past five years due to the demand for larger study cohorts.
  • The results of GPAD can help researchers track and manage gene-disease association discoveries in real-time, with the potential for future enhancements to capture additional information from OMIM

Article Abstract

Background: Thousands of genes have been associated with different Mendelian conditions. One of the valuable sources to track these gene-disease associations (GDAs) is the Online Mendelian Inheritance in Man (OMIM) database. However, most of the information in OMIM is textual, and heterogeneous (e.g. summarized by different experts), which complicates automated reading and understanding of the data. Here, we used Natural Language Processing (NLP) to make a tool (Gene-Phenotype Association Discovery (GPAD)) that could syntactically process OMIM text and extract the data of interest.

Results: GPAD applies a series of language-based techniques to the text obtained from OMIM API to extract GDA discovery-related information. GPAD can inform when a particular gene was associated with a specific phenotype, as well as the type of validation-whether through model organisms or cohort-based patient-matching approaches-for such an association. GPAD extracted data was validated with published reports and was compared with large language model. Utilizing GPAD's extracted data, we analysed trends in GDA discoveries, noting a significant increase in their rate after the introduction of exome sequencing, rising from an average of about 150-250 discoveries each year. Contrary to hopes of resolving most GDAs for Mendelian disorders by now, our data indicate a substantial decline in discovery rates over the past five years (2017-2022). This decline appears to be linked to the increasing necessity for larger cohorts to substantiate GDAs. The rising use of zebrafish and Drosophila as model organisms in providing evidential support for GDAs is also observed.

Conclusions: GPAD's real-time analyzing capacity offers an up-to-date view of GDA discovery and could help in planning and managing the research strategies. In future, this solution can be extended or modified to capture other information in OMIM and scientific literature.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10898068PMC
http://dx.doi.org/10.1186/s12859-024-05693-xDOI Listing

Publication Analysis

Top Keywords

natural language
8
association discovery
8
model organisms
8
extracted data
8
omim
6
gpad
5
data
5
gpad natural
4
language processing-based
4
processing-based application
4

Similar Publications

Large language models (LLMs) are artificial intelligence tools that have the prospect of profoundly changing how we practice all aspects of medicine. Considering the incredible potential of LLMs in medicine and the interest of many health care stakeholders for implementation into routine practice, it is therefore essential that clinicians be aware of the basic risks associated with the use of these models. Namely, a significant risk associated with the use of LLMs is their potential to create hallucinations.

View Article and Find Full Text PDF

Diabetic retinopathy, a retinal disorder resulting from diabetes mellitus, is a prominent cause of visual degradation and loss among the global population. Therefore, the identification and classification of diabetic retinopathy are of utmost importance in the clinical diagnosis and therapy. Currently, these duties are extensively carried out by manual examination utilizing the human visual system.

View Article and Find Full Text PDF

An Automated Approach for Domain-Specific Knowledge Graph Generation─Graph Measures and Characterization.

J Chem Inf Model

January 2025

Center for Engineering Concepts Development, Department of Mechanical Engineering, University of Maryland, College Park, Maryland 20742, United States.

In 2020, nearly 3 million scientific and engineering papers were published worldwide (White, K. Publications Output: U.S.

View Article and Find Full Text PDF

Background: As ferroptosis is a key factor in renal fibrosis (RF), iron deposition monitoring may help evaluating RF. The capability of quantitative susceptibility mapping (QSM) for detecting iron deposition in RF remains uncertain.

Purpose: To investigate the potential of QSM to detect iron deposition in RF.

View Article and Find Full Text PDF

EEG involves recording electrical activity generated by the brain through electrodes placed on the scalp. Imagined speech classification has emerged as an essential area of research in brain-computer interfaces (BCIs). Despite significant advances, accurately classifying imagined speech signals remains challenging due to their complex and non-stationary nature.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!