AI Article Synopsis

  • The study explores how well contemporary large language models (LLMs) can analyze radiology board-style questions that include images, testing their multimodal capabilities.
  • Researchers evaluated 280 selected questions using three formats (multimodal, image-only, text-only) with three LLMs: GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet, applying statistical tests to analyze their performance.
  • Results showed that while GPT-4V and Gemini performed similarly across different input types, Claude 3.5 Sonnet excelled with text and multimodal inputs, but underperformed with image-only inputs, indicating the limitations of LLMs in utilizing images for improved performance in radiology contexts

Article Abstract

Rationale And Objectives: The expansion of large language models to process images offers new avenues for application in radiology. This study aims to assess the multimodal capabilities of contemporary large language models, which allow analysis of image inputs in addition to textual data, on radiology board-style examination questions with images.

Materials And Methods: 280 questions were retrospectively selected from the AuntMinnie public test bank. The test questions were converted into three formats of prompts; (1) Multimodal, (2) Image-only, and (3) Text-only input. Three models, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet, were evaluated using these prompts. The Cochran Q test and pairwise McNemar test were used to compare performances between prompt formats and models.

Results: No difference was found for the performance in terms of % correct answers between the text, image, and multimodal prompt formats for GPT-4V (54%, 52%, and 57%, respectively; p = .31) and Gemini 1.5 Pro (53%, 54%, and 57%, respectively; p = .53). For Claude 3.5 Sonnet, the image input (48%) significantly underperformed compared to the text input (63%, p < .001) and the multimodal input (66%, p < .001), but no difference was found between the text and multimodal inputs (p = .29). Claude significantly outperformed GPT and Gemini in the text and multimodal formats (p < .01).

Conclusion: Vision-capable large language models cannot effectively use images to increase performance on radiology board-style examination questions. When using textual data alone, Claude 3.5 Sonnet outperforms GPT-4V and Gemini 1.5 Pro, highlighting the advancements in the field and its potential for use in further research.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.acra.2024.11.028DOI Listing

Publication Analysis

Top Keywords

large language
12
language models
12
gemini pro
8
claude sonnet
8
prompt formats
8
models
4
models vision
4
vision diagnostic
4
diagnostic radiology
4
radiology board
4

Similar Publications

Background: Mental health chatbots have emerged as a promising tool for providing accessible and convenient support to individuals in need. Building on our previous research on digital interventions for loneliness and depression among Korean college students, this study addresses the limitations identified and explores more advanced artificial intelligence-driven solutions.

Objective: This study aimed to develop and evaluate the performance of HoMemeTown Dr.

View Article and Find Full Text PDF

Chronic non-bacterial osteomyelitis (CNO) is an inflammatory bone disease, usually diagnosed in childhood. It is characterized by the presence of multifocal or unifocal osteolytic lesions that can cause bone pain and soft tissue swelling. CNO is known to have soft tissue involvement.

View Article and Find Full Text PDF

Humans perceive discrete events such as "restaurant visits" and "train rides" in their continuous experience. One important prerequisite for studying human event perception is the ability of researchers to quantify when one event ends and another begins. Typically, this information is derived by aggregating behavioral annotations from several observers.

View Article and Find Full Text PDF

DrugAssist: a large language model for molecule optimization.

Brief Bioinform

November 2024

Department of Computer Science, Hunan University, Changsha 410008, China.

Recently, the impressive performance of large language models (LLMs) on a wide range of tasks has attracted an increasing number of attempts to apply LLMs in drug discovery. However, molecule optimization, a critical task in the drug discovery pipeline, is currently an area that has seen little involvement from LLMs. Most of existing approaches focus solely on capturing the underlying patterns in chemical structures provided by the data, without taking advantage of expert feedback.

View Article and Find Full Text PDF

Basic Science and Pathogenesis.

Alzheimers Dement

December 2024

Vanderbilt Memory & Alzheimer's Center, Vanderbilt University Medical Center, Nashville, TN, USA.

Background: "SuperAgers" are older adults (ages 80+) whose cognitive performance resembles that of adults in their 50s to mid-60s. Factors underlying their exemplary aging are underexplored in large, racially diverse cohorts. Using eight cohorts, we investigated the frequency of APOE genotypes in SuperAgers compared to middle-aged and older adults.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!