Assessing the performance of Microsoft Copilot, GPT-4 and Google Gemini in ophthalmology.

Can J Ophthalmol

Faculty of Medicine, University of Montreal, Montreal, QB, Canada; Department of Ophthalmology, Centre Hospitalier de l'Université de Montréal, Montreal, QB, Canada. Electronic address:

Published: January 2025

Objective: To evaluate the performance of large language models (LLMs), specifically Microsoft Copilot, GPT-4 (GPT-4o and GPT-4o mini), and Google Gemini (Gemini and Gemini Advanced), in answering ophthalmological questions and assessing the impact of prompting techniques on their accuracy.

Design: Prospective qualitative study.

Participants: Microsoft Copilot, GPT-4 (GPT-4o and GPT-4o mini), and Google Gemini (Gemini and Gemini Advanced).

Methods: A total of 300 ophthalmological questions from StatPearls were tested, covering a range of subspecialties and image-based tasks. Each question was evaluated using 2 prompting techniques: zero-shot forced prompting (prompt 1) and combined role-based and zero-shot plan-and-solve+ prompting (prompt 2).

Results: With zero-shot forced prompting, GPT-4o demonstrated significantly superior overall performance, correctly answering 72.3% of questions and outperforming all other models, including Copilot (53.7%), GPT-4o mini (62.0%), Gemini (54.3%), and Gemini Advanced (62.0%) (p < 0.0001). Both Copilot and GPT-4o showed notable improvements with Prompt 2 over Prompt 1, elevating Copilot's accuracy from the lowest (53.7%) to the second highest (72.3%) among the evaluated LLMs.

Conclusions: While newer iterations of LLMs, such as GPT-4o and Gemini Advanced, outperformed their less advanced counterparts (GPT-4o mini and Gemini), this study emphasizes the need for caution in clinical applications of these models. The choice of prompting techniques significantly influences performance, highlighting the necessity for further research to refine LLMs capabilities, particularly in visual data interpretation, to ensure their safe integration into medical practice.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jcjo.2025.01.001DOI Listing

Publication Analysis

Top Keywords

gpt-4o mini
16
gemini gemini
16
microsoft copilot
12
copilot gpt-4
12
google gemini
12
gemini advanced
12
prompting techniques
12
gemini
11
gpt-4o
9
gpt-4 gpt-4o
8

Similar Publications

Assessing the performance of Microsoft Copilot, GPT-4 and Google Gemini in ophthalmology.

Can J Ophthalmol

January 2025

Faculty of Medicine, University of Montreal, Montreal, QB, Canada; Department of Ophthalmology, Centre Hospitalier de l'Université de Montréal, Montreal, QB, Canada. Electronic address:

Objective: To evaluate the performance of large language models (LLMs), specifically Microsoft Copilot, GPT-4 (GPT-4o and GPT-4o mini), and Google Gemini (Gemini and Gemini Advanced), in answering ophthalmological questions and assessing the impact of prompting techniques on their accuracy.

Design: Prospective qualitative study.

Participants: Microsoft Copilot, GPT-4 (GPT-4o and GPT-4o mini), and Google Gemini (Gemini and Gemini Advanced).

View Article and Find Full Text PDF

Introduction The application of natural language processing (NLP) for extracting data from biomedical research has gained momentum with the advent of large language models (LLMs). However, the effect of different LLM parameters, such as temperature settings, on biomedical text mining remains underexplored and a consensus on what settings can be considered "safe" is missing. This study evaluates the impact of temperature settings on LLM performance for a named entity recognition and a classification task in clinical trial publications.

View Article and Find Full Text PDF

Background Large-scale secondary use of clinical databases requires automated tools for retrospective extraction of structured content from free-text radiology reports. Purpose To share data and insights on the application of privacy-preserving open-weights large language models (LLMs) for reporting content extraction with comparison to standard rule-based systems and the closed-weights LLMs from OpenAI. Materials and Methods In this retrospective exploratory study conducted between May 2024 and September 2024, zero-shot prompting of 17 open-weights LLMs was preformed.

View Article and Find Full Text PDF
Article Synopsis
  • The study compared the readability of patient education materials from the Turkish Ophthalmological Association (TOA) on retinopathy of prematurity (ROP) with those generated by large language models (LLMs) like GPT-4.0, GPT-4o mini, and Gemini.
  • The TOA materials were found to exceed the recommended 6th-grade reading level, while GPT-4.0 and Gemini provided significantly clearer responses.
  • GPT-4.0 stood out for its superior accuracy and comprehensiveness in generating understandable patient education materials, but caution is needed regarding regional medical differences when applying LLMs in healthcare.
View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!