ChatGPT vs Gemini: Comparative Accuracy and Efficiency in CAD-RADS Score Assignment from Radiology Reports.

J Imaging Inform Med

Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Clinical Science Building, Medical University of South Carolina, 96 Jonathan Lucas Street, Suite 210, MSC 323, Charleston, SC, 29425, USA.

Published: November 2024

AI Article Synopsis

  • The study evaluated the ability of four language models (ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced) to accurately generate CAD-RADS scores from coronary CT angiography reports without any fine-tuning.
  • ChatGPT-4o had the highest accuracy at 87%, while ChatGPT-3.5 was the fastest but only achieved 50.5% accuracy, and Gemini had a higher failure rate at 12%.
  • Overall, the findings indicate that while model performance showed promise, further refinement of these AI tools is needed before they can be reliably used in clinical decision-making for CAD-RADS scoring.

Article Abstract

This study aimed to evaluate the accuracy and efficiency of ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced in generating CAD-RADS scores based on radiology reports. This retrospective study analyzed 100 consecutive coronary computed tomography angiography reports performed between March 15, 2024, and April 1, 2024, at a single tertiary center. Each report containing a radiologist-assigned CAD-RADS score was processed using four large language models (LLMs) without fine-tuning. The findings section of each report was input into the LLMs, and the models were tasked with generating CAD-RADS scores. The accuracy of LLM-generated scores was compared to the radiologist's score. Additionally, the time taken by each model to complete the task was recorded. Statistical analyses included Mann-Whitney U test and interobserver agreement using unweighted Cohen's Kappa and Krippendorff's Alpha. ChatGPT-4o demonstrated the highest accuracy, correctly assigning CAD-RADS scores in 87% of cases (κ = 0.838, α = 0.886), followed by Gemini Advanced with 82.6% accuracy (κ = 0.784, α = 0.897). ChatGPT-3.5, although the fastest (median time = 5 s), was the least accurate (50.5% accuracy, κ = 0.401, α = 0.787). Gemini exhibited a higher failure rate (12%) compared to the other models, with Gemini Advanced slightly improving upon its predecessor. ChatGPT-4o outperformed other LLMs in both accuracy and agreement with radiologist-assigned CAD-RADS scores, though ChatGPT-3.5 was significantly faster. Despite their potential, current publicly available LLMs require further refinement before being deployed for clinical decision-making in CAD-RADS scoring.

Download full-text PDF

Source
http://dx.doi.org/10.1007/s10278-024-01328-yDOI Listing

Publication Analysis

Top Keywords

cad-rads scores
16
gemini advanced
12
accuracy efficiency
8
cad-rads score
8
radiology reports
8
google gemini
8
generating cad-rads
8
radiologist-assigned cad-rads
8
accuracy
7
cad-rads
7

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!