Artificial intelligence chatbots have demonstrated potential to enhance clinical decision-making and streamline health care workflows, potentially alleviating administrative burdens. However, the contribution of AI chatbots to radiologic decision-making for clinical scenarios remains insufficiently explored. This study evaluates the accuracy and reliability of four prominent Large Language Models (LLMs)-Microsoft Bing, Claude, ChatGPT 3.5, and Perplexity-in offering clinical decision support for initial imaging for suspected pulmonary embolism (PE).  Open-ended (OE) and select-all-that-apply (SATA) questions were crafted, covering four variants of case scenarios of PE in-line with the American College of Radiology Appropriateness Criteria. These questions were presented to the LLMs by three radiologists from diverse geographical regions and setups. The responses were evaluated based on established scoring criteria, with a maximum achievable score of 2 points for OE responses and 1 point for each correct answer in SATA questions. To enable comparative analysis, scores were normalized (score divided by the maximum achievable score).  In OE questions, Perplexity achieved the highest accuracy (0.83), while Claude had the lowest (0.58), with Bing and ChatGPT each scoring 0.75. For SATA questions, Bing led with an accuracy of 0.96, Perplexity was the lowest at 0.56, and both Claude and ChatGPT scored 0.6. Overall, OE questions saw higher scores (0.73) compared to SATA (0.68). There is poor agreement among radiologists' scores for OE (Intraclass Correlation Coefficient [ICC] = -0.067,  = 0.54), while there is strong agreement for SATA (ICC = 0.875,  < 0.001).  The study revealed variations in accuracy across LLMs for both OE and SATA questions. Perplexity showed superior performance in OE questions, while Bing excelled in SATA questions. OE queries yielded better overall results. The current inconsistencies in LLM accuracy highlight the importance of further refinement before these tools can be reliably integrated into clinical practice, with a need for additional LLM fine-tuning and judicious selection by radiologists to achieve consistent and reliable support for decision-making.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11419749PMC
http://dx.doi.org/10.1055/s-0044-1787974DOI Listing

Publication Analysis

Top Keywords

claude chatgpt
12
sata questions
12
radiologic decision-making
8
pulmonary embolism
8
accuracy reliability
8
large language
8
maximum achievable
8
achievable score
8
questions
6
sata
5

Similar Publications

Backgrounds: Biomedical research requires sophisticated understanding and reasoning across multiple specializations. While large language models (LLMs) show promise in scientific applications, their capability to safely and accurately support complex biomedical research remains uncertain.

Methods: We present , a novel question-and-answer benchmark for evaluating LLMs in biomedical research.

View Article and Find Full Text PDF

Background: Diet significantly impacts the onset and progression of inflammatory bowel disease (IBD), and diet offers unique opportunities for treatment and preventative purposes. However, despite growing interest, no diet has been conclusively associated with improved long-term clinical and endoscopic outcomes in IBD, and evidence-based dietary guidelines for IBD remain scarce. This narrative review critically examines dietary assessment methods tailored to the unique needs of IBD, highlighting opportunities for precision and inclusivity.

View Article and Find Full Text PDF

Acute ischemic stroke (AIS) is a leading cause of mortality and disability worldwide, with early and accurate diagnosis being critical for timely intervention and improved patient outcomes. This retrospective study aimed to assess the diagnostic performance of two advanced artificial intelligence (AI) models, Chat Generative Pre-trained Transformer (ChatGPT-4o) and Claude 3.5 Sonnet, in identifying AIS from diffusion-weighted imaging (DWI).

View Article and Find Full Text PDF

Large language models (LLMs) offer promising possibilities in mental health, yet their ability to assess disorders and recommend treatments remains underexplored. This quantitative cross-sectional study evaluated four LLMs (Gemini (Gemini 2.0 Flash Experimental), Claude (Claude 3.

View Article and Find Full Text PDF

Background: Large language models (LLMs) have demonstrated impressive performance on medical licensing and diagnosis-related exams. However, comparative evaluations to optimize LLM performance and ability in the domain of comprehensive medication management (CMM) are lacking. The purpose of this evaluation was to test various LLMs performance optimization strategies and performance on critical care pharmacotherapy questions used in the assessment of Doctor of Pharmacy students.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!