Background: The National Cancer Institute (NCI) Thesaurus provides reference terminology for NCI and other systems. Previously, we proposed a hybrid prototype utilizing lexical features and role definitions of concepts in non-lattice subgraphs to identify missing IS-A relations in the NCI Thesaurus. However, no domain expert evaluation was provided in our previous work. In this paper, we further enhance the hybrid approach by leveraging a novel lexical feature-roots of noun chunks within concept names. Formal evaluation of our enhanced approach is also performed.
Method: We first compute all the non-lattice subgraphs in the NCI Thesaurus. We model each concept using its role definitions, words and roots of noun chunks within its concept name and its ancestor's names. Then we perform subsumption testing for candidate concept pairs in the non-lattice subgraphs to automatically detect potentially missing IS-A relations. Domain experts evaluated the validity of these relations.
Results: We applied our approach to 19.08d version of the NCI Thesaurus. A total of 55 potentially missing IS-A relations were identified by our approach and reviewed by domain experts. 29 out of 55 were confirmed as valid by domain experts and have been incorporated in the newer versions of the NCI Thesaurus. 7 out of 55 further revealed incorrect existing IS-A relations in the NCI Thesaurus.
Conclusions: The results showed that our hybrid approach by leveraging lexical features and role definitions is effective in identifying potentially missing IS-A relations in the NCI Thesaurus.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7737275 | PMC |
http://dx.doi.org/10.1186/s12911-020-01289-6 | DOI Listing |
Nat Methods
January 2025
Department of Medicine, University of California San Diego, La Jolla, CA, USA.
Gene set enrichment is a mainstay of functional genomics, but it relies on gene function databases that are incomplete. Here we evaluate five large language models (LLMs) for their ability to discover the common functions represented by a gene set, supported by molecular rationale and a self-confidence assessment. For curated gene sets from Gene Ontology, GPT-4 suggests functions similar to the curated name in 73% of cases, with higher self-confidence predicting higher similarity.
View Article and Find Full Text PDFPharmacoepidemiol Drug Saf
October 2024
Department of Health Behavior and Policy, Virginia Commonwealth University, Richmond, Virginia, USA.
Background: The accuracy of administrative codes to capture patients with both primary biliary cholangitis (PBC) and cirrhosis could be challenging because of the potential for incorrect coding due to the old nomenclature "Primary Biliary Cirrhosis." Therefore, the aim of this study was to examine the positive predictive value (PPV) of International Classification of Diseases (ICD) codes for PBC and cirrhosis.
Methods: This was a retrospective cohort study using data from the VA Corporate Data Warehouse.
Background: .-Laryngeal malignancy, "voice box" cancer, is uncommon with 12,620 estimated new cases and 3770 deaths in the United States in 2021,1 and represents only 6.2% of all respiratory system malignancies.
View Article and Find Full Text PDFSci Rep
July 2024
Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC, Canada.
Brief Bioinform
July 2024
Department of Computer Science and Software Engineering, Auburn University, AL 36849, USA.
This manuscript describes the development of a resource module that is part of a learning platform named 'NIGMS Sandbox for Cloud-based Learning' (https://github.com/NIGMS/NIGMS-Sandbox). The module delivers learning materials on Cloud-based Consensus Pathway Analysis in an interactive format that uses appropriate cloud resources for data access and analyses.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!