Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.

Jean-Paul Bereuter Mark Enrik Geissler Anna Klimova Robert-Patrick Steiner Kevin Pfeiffer Fiona R Kolbinger Isabella C Wiest Hannah Sophie Muti Jakob Nikolas Kather

J Surg Educ

Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Medical Oncology, National Center for Tumor Diseases, University Hospital Heidelberg, Heidelberg, Germany; Department of Medicine I, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.

Published: April 2025

Objective: Recent studies investigated the potential of large language models (LLMs) for clinical decision making and answering exam questions based on text input. Recent developments of LLMs have extended these models with vision capabilities. These image processing LLMs are called vision-language models (VLMs). However, there is limited investigation on the applicability of VLMs and their capabilities of answering exam questions with image content. Therefore, the aim of this study was to examine the performance of publicly accessible LLMs in 2 different surgical question sets consisting of text and image questions.

Design: Original text and image exam questions from 2 different surgical question subsets from the German Medical Licensing Examination (GMLE) and United States Medical Licensing Examination (USMLE) were collected and answered by publicly available LLMs (GPT-4, Claude-3 Sonnet, Gemini-1.5). LLM outputs were benchmarked for their accuracy in answering text and image questions. Additionally, the LLMs' performance was compared to students' performance based on their average historical performance (AHP) in these exams. Moreover, variations of LLM performance were analyzed in relation to question difficulty and respective image type.

Results: Overall, all LLMs achieved scores equivalent to passing grades (≥60%) on surgical text questions across both datasets. On image-based questions, only GPT-4 exceeded the score required to pass, significantly outperforming Claude-3 and Gemini-1.5 (GPT: 78% vs. Claude-3: 58% vs. Gemini-1.5: 57.3%; p < 0.001). Additionally, GPT-4 outperformed students on both text (GPT: 83.7% vs. AHP students: 67.8%; p < 0.001) and image questions (GPT: 78% vs. AHP students: 67.4%; p < 0.001).

Conclusion: GPT-4 demonstrated substantial capabilities in answering surgical text and image exam questions. Therefore, it holds considerable potential for the use in surgical decision making and education of students and trainee surgeons.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.jsurg.2025.103442	DOI Listing

Publication Analysis

Top Keywords

exam questions

text image

questions

vision capabilities

large language

language models

decision making

answering exam

image

capabilities answering

Similar Publications

The influence of student classes in psychiatry on the perception of people with intellectual disabilities.

Pol Merkur Lekarski

March 2025

DEPARTAMENT OF CLINICAL PHARMACOLOGY, DEPARTAMENT OF PHARMACOLOGY AND TOXICOLOGY, MEDICAL UNIVERSITY OF LODZ, LODZ, POLAND.

Karol Batko Aleksander Stefanik Andrzej Witusik Kasper Sipowicz Tadeusz Pietras

Objective: Aim: To gain knowledge about the attitudes of medical students towards people with intellectual disabilities and the impact of psychiatry teaching on changing these attitudes..

Patients And Methods: Materials and Methods: The study involved 106 students of medical faculties who had not yet taken a course in psychiatry and 104 who had completed the course and passed the exam.

View Article and Find Full Text PDF

Similar Publications

Automated evaluation systems to enhance exam quality and reduce test anxiety.

PeerJ Comput Sci

February 2025

Educational Technology and Computer Department, Faculty of Specific Education, Kafrelshiekh University, Kafrelshiekh, Egypt.

Doaa Mohamed Elbourhamy

University examination papers play a crucial role in the institution's quality, impacting the institution's accreditation status. In this context, ensuring the quality of examination papers is paramount. In practice, however, manual assessments are mostly laborious and time-consuming and generally lack consistency.

View Article and Find Full Text PDF

Similar Publications

Can ChatGPT pass the Turkish Orthopedics and Traumatology Board Examination? Turkish orthopedic surgeons versus artificial intelligence.

Ulus Travma Acil Cerrahi Derg

March 2025

Department of Orthopedics and Traumatology, Nişantaşı University, İstanbul-Türkiye.

Çağdaş Pamuk Abdullah Faruk Uyanık Ersin Kuyucu Meriç Uğurlar

Background: Artificial intelligence has been shown to achieve successful outcomes in various orthopedic qualification examinations worldwide. This study aims to assess the performance of ChatGPT in the written section of the Turkish Orthopedics and Traumatology Board Examination, compare its results with those of candidates who took the exam, and determine whether ChatGPT is sufficient to achieve a passing score.

Methods: This retrospective observational study evaluated whether ChatGPT achieved a passing grade on 400 publicly available questions from the Turkish orthopedics qualification exam over the past four years.

View Article and Find Full Text PDF

Similar Publications

Applications of Artificial Intelligence in Medical Education: A Systematic Review.

Cureus

March 2025

Department of Anatomical Sciences, St. George's University School of Medicine, St. George, GRD.

Eric Hallquist Ishank Gupta Michael Montalbano Marios Loukas

Artificial intelligence (AI) models, like Chat Generative Pre-Trained Transformer (OpenAI, San Francisco, CA), have recently gained significant popularity due to their ability to make autonomous decisions and engage in complex interactions. To fully harness the potential of these learning machines, users must understand their strengths and limitations. As AI tools become increasingly prevalent in our daily lives, it is essential to explore how this technology has been used so far in healthcare and medical education, as well as the areas of medicine where it can be applied.

View Article and Find Full Text PDF

Similar Publications

Does Peer Instruction Enhance Impact of Audience Response Systems for Content Retention?

J Dent Educ

March 2025

Department of Restorative Dentistry & Prosthodontics, University of Texas School of Dentistry at Houston, Houston, Texas, USA.

Francesca Scalise Ryan L Quock

Purpose: This study examined whether peer instruction enhanced the retention of content for learners who participated in a review session utilizing an Audience Response System (ARS).

Methods: Review sessions for two groups of students taking the same course were conducted. Both groups utilized ARS to answer questions presented in the session, while only one group also utilized an educational method known as peer instruction-otherwise, sessions were identical.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!