Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models-Bing, Claude, ChatGPT, and Perplexity.

Pradosh Kumar Sarangi Suvrankar Datta M Sarthak Swarup Swaha Panda Debasish Swapnesh Kumar Nayak Archana Malik Ananda Datta Himel Mondal

Indian J Radiol Imaging

Department of Physiology, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India.

Published: October 2024

Artificial intelligence chatbots have demonstrated potential to enhance clinical decision-making and streamline health care workflows, potentially alleviating administrative burdens. However, the contribution of AI chatbots to radiologic decision-making for clinical scenarios remains insufficiently explored. This study evaluates the accuracy and reliability of four prominent Large Language Models (LLMs)-Microsoft Bing, Claude, ChatGPT 3.5, and Perplexity-in offering clinical decision support for initial imaging for suspected pulmonary embolism (PE). Open-ended (OE) and select-all-that-apply (SATA) questions were crafted, covering four variants of case scenarios of PE in-line with the American College of Radiology Appropriateness Criteria. These questions were presented to the LLMs by three radiologists from diverse geographical regions and setups. The responses were evaluated based on established scoring criteria, with a maximum achievable score of 2 points for OE responses and 1 point for each correct answer in SATA questions. To enable comparative analysis, scores were normalized (score divided by the maximum achievable score). In OE questions, Perplexity achieved the highest accuracy (0.83), while Claude had the lowest (0.58), with Bing and ChatGPT each scoring 0.75. For SATA questions, Bing led with an accuracy of 0.96, Perplexity was the lowest at 0.56, and both Claude and ChatGPT scored 0.6. Overall, OE questions saw higher scores (0.73) compared to SATA (0.68). There is poor agreement among radiologists' scores for OE (Intraclass Correlation Coefficient [ICC] = -0.067, = 0.54), while there is strong agreement for SATA (ICC = 0.875, < 0.001). The study revealed variations in accuracy across LLMs for both OE and SATA questions. Perplexity showed superior performance in OE questions, while Bing excelled in SATA questions. OE queries yielded better overall results. The current inconsistencies in LLM accuracy highlight the importance of further refinement before these tools can be reliably integrated into clinical practice, with a need for additional LLM fine-tuning and judicious selection by radiologists to achieve consistent and reliable support for decision-making.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11419749	PMC
http://dx.doi.org/10.1055/s-0044-1787974	DOI Listing

Publication Analysis

Top Keywords

claude chatgpt

sata questions

radiologic decision-making

pulmonary embolism

accuracy reliability

large language

maximum achievable

achievable score

questions

sata

Similar Publications

CARDBiomedBench: A Benchmark for Evaluating Large Language Model Performance in Biomedical Research.

bioRxiv

January 2025

Owen Bianchi Maya Willey Chelsea X Alvarado Benjamin Danek Marzieh Khani

Backgrounds: Biomedical research requires sophisticated understanding and reasoning across multiple specializations. While large language models (LLMs) show promise in scientific applications, their capability to safely and accurately support complex biomedical research remains uncertain.

Methods: We present , a novel question-and-answer benchmark for evaluating LLMs in biomedical research.

View Article and Find Full Text PDF

Similar Publications

Food Is Medicine: Diet Assessment Tools in Adult Inflammatory Bowel Disease Research.

Nutrients

January 2025

Research Unit for Dietary Studies at The Parker Institute, Bispebjerg and Frederiksberg Hospital, 2000 Frederiksberg, Denmark.

Vibeke Andersen Anette Liljensøe Laura Gregersen Behrooz Darbani Thorhallur Ingi Halldorsson

Background: Diet significantly impacts the onset and progression of inflammatory bowel disease (IBD), and diet offers unique opportunities for treatment and preventative purposes. However, despite growing interest, no diet has been conclusively associated with improved long-term clinical and endoscopic outcomes in IBD, and evidence-based dietary guidelines for IBD remain scarce. This narrative review critically examines dietary assessment methods tailored to the unique needs of IBD, highlighting opportunities for precision and inclusivity.

View Article and Find Full Text PDF

Similar Publications

Evaluation of Advanced Artificial Intelligence Algorithms' Diagnostic Efficacy in Acute Ischemic Stroke: A Comparative Analysis of ChatGPT-4o and Claude 3.5 Sonnet Models.

J Clin Med

January 2025

Department of Radiology, Kastamonu University, Kastamonu 37150, Turkey.

Mustafa Koyun Ismail Taskent

Acute ischemic stroke (AIS) is a leading cause of mortality and disability worldwide, with early and accurate diagnosis being critical for timely intervention and improved patient outcomes. This retrospective study aimed to assess the diagnostic performance of two advanced artificial intelligence (AI) models, Chat Generative Pre-trained Transformer (ChatGPT-4o) and Claude 3.5 Sonnet, in identifying AIS from diffusion-weighted imaging (DWI).

View Article and Find Full Text PDF

Similar Publications

Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals.

Eur J Investig Health Psychol Educ

January 2025

Faculty of Education, Tel-Hai Academic College, Upper Galilee 2208, Israel.

Inbar Levkovich

Large language models (LLMs) offer promising possibilities in mental health, yet their ability to assess disorders and recommend treatments remains underexplored. This quantitative cross-sectional study evaluated four LLMs (Gemini (Gemini 2.0 Flash Experimental), Claude (Claude 3.

View Article and Find Full Text PDF

Similar Publications

Evaluating accuracy and reproducibility of large language model performance on critical care assessments in pharmacy education.

Front Artif Intell

January 2025

Department of Clinical and Administrative Pharmacy, University of Georgia College of Pharmacy, Augusta, GA, United States.

Huibo Yang Mengxuan Hu Amoreena Most W Anthony Hawkins Brian Murray

Background: Large language models (LLMs) have demonstrated impressive performance on medical licensing and diagnosis-related exams. However, comparative evaluations to optimize LLM performance and ability in the domain of comprehensive medication management (CMM) are lacking. The purpose of this evaluation was to test various LLMs performance optimization strategies and performance on critical care pharmacotherapy questions used in the assessment of Doctor of Pharmacy students.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!