Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease.

Junyoung Kim Kai Wang Chunhua Weng Cong Liu

Am J Hum Genet

Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA. Electronic address:

Published: October 2024

Phenotype-driven gene prioritization is essential for diagnosing rare genetic disorders, and traditional methods use curated knowledge graphs for phenotype-gene links.
This study assessed five large language models (LLMs) to analyze their effectiveness in predicting genes based on phenotypic input, with GPT-4 showing the best performance but still underperforming compared to traditional tools.
Findings indicated that while advanced techniques and complex prompts had mixed effects, overall, LLMs provided better-than-random accuracy, although they favored predicting well-known genes, suggesting potential limitations in clinical applications.

Phenotype-driven gene prioritization is fundamental to diagnosing rare genetic disorders. While traditional approaches rely on curated knowledge graphs with phenotype-gene relations, recent advancements in large language models (LLMs) promise a streamlined text-to-gene solution. In this study, we evaluated five LLMs, including two generative pre-trained transformers (GPT) series and three Llama2 series, assessing their performance across task completeness, gene prediction accuracy, and adherence to required output structures. We conducted experiments, exploring various combinations of models, prompts, phenotypic input types, and task difficulty levels. Our findings revealed that the best-performed LLM, GPT-4, achieved an average accuracy of 17.0% in identifying diagnosed genes within the top 50 predictions, which still falls behind traditional tools. However, accuracy increased with the model size. Consistent results were observed over time, as shown in the dataset curated after 2023. Advanced techniques such as retrieval-augmented generation (RAG) and few-shot learning did not improve the accuracy. Sophisticated prompts were more likely to enhance task completeness, especially in smaller models. Conversely, complicated prompts tended to decrease output structure compliance rate. LLMs also achieved better-than-random prediction accuracy with free-text input, though performance was slightly lower than with standardized concept input. Bias analysis showed that highly cited genes, such as BRCA1, TP53, and PTEN, are more likely to be predicted. Our study provides valuable insights into integrating LLMs with genomic analysis, contributing to the ongoing discussion on their utilization in clinical workflows.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11480789	PMC
http://dx.doi.org/10.1016/j.ajhg.2024.08.010	DOI Listing

Publication Analysis

Top Keywords

large language

language models

phenotype-driven gene

gene prioritization

rare genetic

task completeness

prediction accuracy

accuracy

assessing utility

utility large

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!