Background And Aim: Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes.
Methods: We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief concern, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs' training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models' explanations for a subset of cases.
Results: LLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8 %, Claude Sonnet 3.5: 59.5 %, Physicians: 39.5 %, p < 0.001, Bonferroni-adjusted). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5 %, p < 0.001; Claude Sonnet 3.5: 67.3 %, p = 0.060; Physicians: 78.8 %, p < 0.001, all Bonferroni-adjusted). LLMs altered their explanatory reasoning in 45-60 % of cases when images were provided.
Conclusion: Multimodal LLMs showed higher diagnostic accuracy than physicians in text-only scenarios, even in cases designed to require visual interpretation, suggesting that while images can enhance diagnostic accuracy, they may not be essential in every instance. Although adding images further improved LLM performance, the magnitude of this improvement was smaller than that observed in physicians. These findings suggest that enhanced visual data processing may be needed for LLMs to achieve the degree of image-related performance gains seen in human examiners.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11754970 | PMC |
http://dx.doi.org/10.1016/j.csbj.2024.12.019 | DOI Listing |
Comput Struct Biotechnol J
December 2024
Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Background And Aim: Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes.
Methods: We tested GPT-4o and Claude Sonnet 3.
BMJ Health Care Inform
January 2025
Department of Cardiac Surgery, University Hospital of Zurich, Zurich, Switzerland
Objectives: We aimed to evaluate the performance of multiple large language models (LLMs) in data extraction from unstructured and semi-structured electronic health records.
Methods: 50 synthetic medical notes in English, containing a structured and an unstructured part, were drafted and evaluated by domain experts, and subsequently used for LLM-prompting. 18 LLMs were evaluated against a baseline transformer-based model.
Adv Physiol Educ
January 2025
College of Medicine, Alfaisal University, Kingdom of Saudi Arabia.
Despite extensive studies on large language models and their capability to respond to questions from various licensed exams, there has been limited focus on employing chatbots for specific subjects within the medical curriculum, specifically medical neuroscience. This research compared the performances of Claude 3.5 Sonnet (Anthropic), GPT-3.
View Article and Find Full Text PDFAdv Physiol Educ
January 2025
Department of Kinesiology and Outdoor Recreation, Southern Utah University, Cedar City, UT, USA.
Learning Objectives (LOs) are a pillar of course design and execution, and thus a focus of curricular reforms. This study explored the extent to which the creation and usage of LOs might be facilitated by three leading chatbots: ChatGPT-4o, Claude 3.5 Sonnet, and Google Gemini Advanced.
View Article and Find Full Text PDFPurpose: We present an updated study evaluating the performance of large language models (LLMs) in answering radiation oncology physics questions, focusing on the recently released models.
Methods: A set of 100 multiple choice radiation oncology physics questions, previously created by a well-experienced physicist, was used for this study. The answer options of the questions were randomly shuffled to create "new" exam sets.
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!