Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.

J Surg Educ

Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Medical Oncology, National Center for Tumor Diseases, University Hospital Heidelberg, Heidelberg, Germany; Department of Medicine I, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.

Published: April 2025

Objective: Recent studies investigated the potential of large language models (LLMs) for clinical decision making and answering exam questions based on text input. Recent developments of LLMs have extended these models with vision capabilities. These image processing LLMs are called vision-language models (VLMs). However, there is limited investigation on the applicability of VLMs and their capabilities of answering exam questions with image content. Therefore, the aim of this study was to examine the performance of publicly accessible LLMs in 2 different surgical question sets consisting of text and image questions.

Design: Original text and image exam questions from 2 different surgical question subsets from the German Medical Licensing Examination (GMLE) and United States Medical Licensing Examination (USMLE) were collected and answered by publicly available LLMs (GPT-4, Claude-3 Sonnet, Gemini-1.5). LLM outputs were benchmarked for their accuracy in answering text and image questions. Additionally, the LLMs' performance was compared to students' performance based on their average historical performance (AHP) in these exams. Moreover, variations of LLM performance were analyzed in relation to question difficulty and respective image type.

Results: Overall, all LLMs achieved scores equivalent to passing grades (≥60%) on surgical text questions across both datasets. On image-based questions, only GPT-4 exceeded the score required to pass, significantly outperforming Claude-3 and Gemini-1.5 (GPT: 78% vs. Claude-3: 58% vs. Gemini-1.5: 57.3%; p < 0.001). Additionally, GPT-4 outperformed students on both text (GPT: 83.7% vs. AHP students: 67.8%; p < 0.001) and image questions (GPT: 78% vs. AHP students: 67.4%; p < 0.001).

Conclusion: GPT-4 demonstrated substantial capabilities in answering surgical text and image exam questions. Therefore, it holds considerable potential for the use in surgical decision making and education of students and trainee surgeons.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jsurg.2025.103442DOI Listing

Publication Analysis

Top Keywords

exam questions
16
text image
16
questions
9
vision capabilities
8
large language
8
language models
8
decision making
8
answering exam
8
image
8
capabilities answering
8

Similar Publications

Objective: Aim: To gain knowledge about the attitudes of medical students towards people with intellectual disabilities and the impact of psychiatry teaching on changing these attitudes..

Patients And Methods: Materials and Methods: The study involved 106 students of medical faculties who had not yet taken a course in psychiatry and 104 who had completed the course and passed the exam.

View Article and Find Full Text PDF

Automated evaluation systems to enhance exam quality and reduce test anxiety.

PeerJ Comput Sci

February 2025

Educational Technology and Computer Department, Faculty of Specific Education, Kafrelshiekh University, Kafrelshiekh, Egypt.

University examination papers play a crucial role in the institution's quality, impacting the institution's accreditation status. In this context, ensuring the quality of examination papers is paramount. In practice, however, manual assessments are mostly laborious and time-consuming and generally lack consistency.

View Article and Find Full Text PDF

Background: Artificial intelligence has been shown to achieve successful outcomes in various orthopedic qualification examinations worldwide. This study aims to assess the performance of ChatGPT in the written section of the Turkish Orthopedics and Traumatology Board Examination, compare its results with those of candidates who took the exam, and determine whether ChatGPT is sufficient to achieve a passing score.

Methods: This retrospective observational study evaluated whether ChatGPT achieved a passing grade on 400 publicly available questions from the Turkish orthopedics qualification exam over the past four years.

View Article and Find Full Text PDF

Artificial intelligence (AI) models, like Chat Generative Pre-Trained Transformer (OpenAI, San Francisco, CA), have recently gained significant popularity due to their ability to make autonomous decisions and engage in complex interactions. To fully harness the potential of these learning machines, users must understand their strengths and limitations. As AI tools become increasingly prevalent in our daily lives, it is essential to explore how this technology has been used so far in healthcare and medical education, as well as the areas of medicine where it can be applied.

View Article and Find Full Text PDF

Purpose: This study examined whether peer instruction enhanced the retention of content for learners who participated in a review session utilizing an Audience Response System (ARS).

Methods: Review sessions for two groups of students taking the same course were conducted. Both groups utilized ARS to answer questions presented in the session, while only one group also utilized an educational method known as peer instruction-otherwise, sessions were identical.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!