Artificial intelligence chatbots have demonstrated potential to enhance clinical decision-making and streamline health care workflows, potentially alleviating administrative burdens. However, the contribution of AI chatbots to radiologic decision-making for clinical scenarios remains insufficiently explored. This study evaluates the accuracy and reliability of four prominent Large Language Models (LLMs)-Microsoft Bing, Claude, ChatGPT 3.5, and Perplexity-in offering clinical decision support for initial imaging for suspected pulmonary embolism (PE). Open-ended (OE) and select-all-that-apply (SATA) questions were crafted, covering four variants of case scenarios of PE in-line with the American College of Radiology Appropriateness Criteria. These questions were presented to the LLMs by three radiologists from diverse geographical regions and setups. The responses were evaluated based on established scoring criteria, with a maximum achievable score of 2 points for OE responses and 1 point for each correct answer in SATA questions. To enable comparative analysis, scores were normalized (score divided by the maximum achievable score). In OE questions, Perplexity achieved the highest accuracy (0.83), while Claude had the lowest (0.58), with Bing and ChatGPT each scoring 0.75. For SATA questions, Bing led with an accuracy of 0.96, Perplexity was the lowest at 0.56, and both Claude and ChatGPT scored 0.6. Overall, OE questions saw higher scores (0.73) compared to SATA (0.68). There is poor agreement among radiologists' scores for OE (Intraclass Correlation Coefficient [ICC] = -0.067, = 0.54), while there is strong agreement for SATA (ICC = 0.875, < 0.001). The study revealed variations in accuracy across LLMs for both OE and SATA questions. Perplexity showed superior performance in OE questions, while Bing excelled in SATA questions. OE queries yielded better overall results. The current inconsistencies in LLM accuracy highlight the importance of further refinement before these tools can be reliably integrated into clinical practice, with a need for additional LLM fine-tuning and judicious selection by radiologists to achieve consistent and reliable support for decision-making.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11419749 | PMC |
http://dx.doi.org/10.1055/s-0044-1787974 | DOI Listing |
Backgrounds: Biomedical research requires sophisticated understanding and reasoning across multiple specializations. While large language models (LLMs) show promise in scientific applications, their capability to safely and accurately support complex biomedical research remains uncertain.
Methods: We present , a novel question-and-answer benchmark for evaluating LLMs in biomedical research.
Nutrients
January 2025
Research Unit for Dietary Studies at The Parker Institute, Bispebjerg and Frederiksberg Hospital, 2000 Frederiksberg, Denmark.
Background: Diet significantly impacts the onset and progression of inflammatory bowel disease (IBD), and diet offers unique opportunities for treatment and preventative purposes. However, despite growing interest, no diet has been conclusively associated with improved long-term clinical and endoscopic outcomes in IBD, and evidence-based dietary guidelines for IBD remain scarce. This narrative review critically examines dietary assessment methods tailored to the unique needs of IBD, highlighting opportunities for precision and inclusivity.
View Article and Find Full Text PDFJ Clin Med
January 2025
Department of Radiology, Kastamonu University, Kastamonu 37150, Turkey.
Acute ischemic stroke (AIS) is a leading cause of mortality and disability worldwide, with early and accurate diagnosis being critical for timely intervention and improved patient outcomes. This retrospective study aimed to assess the diagnostic performance of two advanced artificial intelligence (AI) models, Chat Generative Pre-trained Transformer (ChatGPT-4o) and Claude 3.5 Sonnet, in identifying AIS from diffusion-weighted imaging (DWI).
View Article and Find Full Text PDFEur J Investig Health Psychol Educ
January 2025
Faculty of Education, Tel-Hai Academic College, Upper Galilee 2208, Israel.
Large language models (LLMs) offer promising possibilities in mental health, yet their ability to assess disorders and recommend treatments remains underexplored. This quantitative cross-sectional study evaluated four LLMs (Gemini (Gemini 2.0 Flash Experimental), Claude (Claude 3.
View Article and Find Full Text PDFFront Artif Intell
January 2025
Department of Clinical and Administrative Pharmacy, University of Georgia College of Pharmacy, Augusta, GA, United States.
Background: Large language models (LLMs) have demonstrated impressive performance on medical licensing and diagnosis-related exams. However, comparative evaluations to optimize LLM performance and ability in the domain of comprehensive medication management (CMM) are lacking. The purpose of this evaluation was to test various LLMs performance optimization strategies and performance on critical care pharmacotherapy questions used in the assessment of Doctor of Pharmacy students.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!