Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis.

Asia Pac J Ophthalmol (Phila)

Retina Division, Wilmer Eye Institute, Johns Hopkins University, Baltimore, MD 21287, USA. Electronic address:

Published: October 2024

AI Article Synopsis

  • This study looked at how well large language models (LLMs) like ChatGPT and Bing Chat can answer questions about eye medicine.
  • They found that out of 14 studies, most tested different topics in eye medicine, with ChatGPT-4 being the best at answering questions.
  • Overall, the models got 65% of the answers right, but they did best in "pathology" and struggled more with basic principles of eye care.

Article Abstract

Purpose: To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions.

Design: Meta-analysis.

Methods: Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed.

Results: Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61-0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73-0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51-0.54). LLMs performed best in "pathology" (0.78 [95 % CI: 0.70-0.86]) and worst in "fundamentals and principles of ophthalmology" (0.52 [95 % CI: 0.48-0.56]).

Conclusions: The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.apjo.2024.100106DOI Listing

Publication Analysis

Top Keywords

ophthalmology board-style
20
answering ophthalmology
16
board-style questions
16
llms answering
12
accuracy llms
12
pooled accuracy
12
ophthalmology topics
12
llms
9
accuracy large
8
large language
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!