Purpose: This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions.
Methods: This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5 edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar's and chi-squared tests. Management scores were analyzed using the Kruskal- Wallis and Wilcoxon tests.
Results: Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) ( < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) ( > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists ( > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs ( < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 ± 0.97) ( < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus ( < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories ( < 0.05).
Conclusion: Although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show promise in text-based BI-RADS assessments, their limitations in visual diagnostics suggest they should be used cautiously and under radiologists' supervision to avoid misdiagnoses.
Clinical Significance: This study demonstrates that while LLMs exhibit strong capabilities in text-based BI-RADS assessments, their visual diagnostic abilities are currently limited, necessitating further development and cautious application in clinical practice.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.4274/dir.2024.242876 | DOI Listing |
Purpose: We present an updated study evaluating the performance of large language models (LLMs) in answering radiation oncology physics questions, focusing on the recently released models.
Methods: A set of 100 multiple choice radiation oncology physics questions, previously created by a well-experienced physicist, was used for this study. The answer options of the questions were randomly shuffled to create "new" exam sets.
Sci Rep
January 2025
Department of Engineering, iHealth Labs, Sunnyvale, CA, 94085, United States.
Large language models (LLMs) are fundamentally transforming human-facing applications in the health and well-being domains: boosting patient engagement, accelerating clinical decision-making, and facilitating medical education. Although state-of-the-art LLMs have shown superior performance in several conversational applications, evaluations within nutrition and diet applications are still insufficient. In this paper, we propose to employ the Registered Dietitian (RD) exam to conduct a standard and comprehensive evaluation of state-of-the-art LLMs, GPT-4o, Claude 3.
View Article and Find Full Text PDFSci Rep
January 2025
Eye Center, Medical Center, Faculty of Medicine, University of Freiburg, Kilianstraße 5, 79106, Freiburg, Germany.
Fuchs Endothelial Corneal Dystrophy (FECD) is the most frequent indication for corneal transplantation, with Descemet membrane endothelial keratoplasty (DMEK), Descemet stripping automated endothelial keratoplasty (DSAEK), and penetrating keratoplasty (PK) being viable options. This retrospective study compared 10-year outcomes of these techniques in a large cohort of 2956 first-time keratoplasty eyes treated for FECD at a high-volume corneal transplant center in Germany. While DMEK and DSAEK provided faster visual recovery (median time to BSCVA ≥ 6/12 Snellen: DMEK 7.
View Article and Find Full Text PDFiScience
December 2024
Department of Stomatology, the First Affiliated Hospital, Fujian Medical University, Fuzhou 350005, China.
Large language models (LLMs) offer potential in primary dental care. We conducted an evaluation of LLMs' diagnostic capabilities across various oral diseases and contexts. All LLMs showed diagnostic capabilities for temporomandibular joint disorders, periodontal disease, dental caries, and malocclusion.
View Article and Find Full Text PDFBMJ
December 2024
QuantumBlack Analytics, London, UK.
Objective: To evaluate the cognitive abilities of the leading large language models and identify their susceptibility to cognitive impairment, using the Montreal Cognitive Assessment (MoCA) and additional tests.
Design: Cross sectional analysis.
Setting: Online interaction with large language models via text based prompts.
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!