Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis.

JMIR Med Educ

Department of Ultrasound, Peking University First Hospital, 8 Xishiku Rd, Xicheng District, Beijing, 100034, China, 86 13132150190, 86 314521.

Published: January 2025

Background: Artificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy.

Objective: This study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams.

Methods: A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2 tests and ANOVA.

Results: GPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; P<.001), Bard 54.7% (82/150; P<.001), Tongyi Qianwen 70.7% (106/150; P=.009), and Gemini Pro 55.3% (83/150; P<.001). The odds ratios compared to GPT-4 were 0.33 (95% CI 0.18-0.60) for Claude, 0.24 (95% CI 0.13-0.44) for Bard, and 0.25 (95% CI 0.14-0.45) for Gemini Pro. Tongyi Qianwen performed relatively well with an accuracy of 70.7% (106/150; P=0.02) and had an odds ratio of 0.48 (95% CI 0.27-0.87) compared to GPT-4. Performance varied across question types and specialties, with GPT-4 excelling in both lower-order and higher-order questions, while Claude and Bard struggled with complex diagnostic questions.

Conclusions: GPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models' effectiveness in specialized fields like radiology.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11756834	PMC
http://dx.doi.org/10.2196/64284	DOI Listing

Publication Analysis

Top Keywords

large language

language models

radiology board

board exams

comparative analysis

models

performance evaluation

evaluation implications

implications large

radiology

Similar Publications

Assessing the Adherence of ChatGPT Chatbots to Public Health Guidelines for Smoking Cessation: Content Analysis.

J Med Internet Res

January 2025

Department of Engineering Management and Systems Engineering, George Washington University, Washington, DC, United States.

Lorien C Abroms Artin Yousefi Christina N Wysota Tien-Chin Wu David A Broniatowski

Background: Large language model (LLM) artificial intelligence chatbots using generative language can offer smoking cessation information and advice. However, little is known about the reliability of the information provided to users.

Objective: This study aims to examine whether 3 ChatGPT chatbots-the World Health Organization's Sarah, BeFreeGPT, and BasicGPT-provide reliable information on how to quit smoking.

View Article and Find Full Text PDF

Similar Publications

Assessing Familiarity, Usage Patterns, and Attitudes of Medical Students Toward ChatGPT and Other Chat-Based AI Apps in Medical Education: Cross-Sectional Questionnaire Study.

JMIR Med Educ

January 2025

College of Medicine, Alfaisal University, Takhasussi street, Riyadh, 11533, Saudi Arabia, 966 559441589.

Safia Elwaleed Elhassan Muhammad Raihan Sajid Amina Mariam Syed Sidrah Afreen Fathima Bushra Shehroz Khan

Background: There has been a rise in the popularity of ChatGPT and other chat-based artificial intelligence (AI) apps in medical education. Despite data being available from other parts of the world, there is a significant lack of information on this topic in medical education and research, particularly in Saudi Arabia.

Objective: The primary objective of the study was to examine the familiarity, usage patterns, and attitudes of Alfaisal University medical students toward ChatGPT and other chat-based AI apps in medical education.

View Article and Find Full Text PDF

Similar Publications

Learning the language of life with AI.

Science

January 2025

Eric J. Topol is the founder and director of the Scripps Research Translational Institute; executive vice president of Scripps Research; chair of the Department of Translational Medicine at Scripps Research; and the Gary and Mary West Endowed Chair of Innovative Medicine at Scripps Research, La Jolla, CA, USA..

Eric J Topol

In 2021, a year before ChatGPT took the world by storm amid the excitement about generative artificial intelligence (AI), AlphaFold 2 cracked the 50-year-old protein-folding problem, predicting three-dimensional (3D) structures for more than 200 million proteins from their amino acid sequences. This accomplishment was a precursor to an unprecedented burgeoning of large language models (LLMs) in the life sciences. That was just the beginning.

View Article and Find Full Text PDF

Similar Publications

Signals of propaganda-Detecting and estimating political influences in information spread in social networks.

PLoS One

January 2025

Department of Information Technologies, Faculty of Economics and Management, Czech University of Life Sciences Prague, Prague, Czech Republic.

Alon Sela Omer Neter Václav Lohr Petr Cihelka Fan Wang

Social networks are a battlefield for political propaganda. Protected by the anonymity of the internet, political actors use computational propaganda to influence the masses. Their methods include the use of synchronized or individual bots, multiple accounts operated by one social media management tool, or different manipulations of search engines and social network algorithms, all aiming to promote their ideology.

View Article and Find Full Text PDF

Similar Publications

Entity-enhanced BERT for medical specialty prediction based on clinical questionnaire data.

PLoS One

January 2025

School of Industrial and Management Engineering, Korea University, Seongbuk-gu, Seoul, Republic of Korea.

Soyeon Lee Ye Ji Han Hyun Joon Park Byung Hoon Lee DaHee Son

A medical specialty prediction system for remote diagnosis can reduce the unexpected costs incurred by first-visit patients who visit the wrong hospital department for their symptoms. To develop medical specialty prediction systems, several researchers have explored clinical predictive models using real medical text data. Medical text data include large amounts of information regarding patients, which increases the sequence length.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!