Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4.

Adi Lahat Kassem Sharif Narmin Zoabi Yonatan Shneor Patt Yousra Sharif Lior Fisher Uria Shani Mohamad Arow Roni Levin Eyal Klang

J Med Internet Res

Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, United States.

Published: June 2024

Background: Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement.

Objective: This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors' and residents' ratings, and specific question types.

Methods: A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications.

Results: Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5's accuracy, beneficial, and completeness dimensions.

Conclusions: ChatGPT's potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11240076	PMC
http://dx.doi.org/10.2196/54571	DOI Listing

Publication Analysis

Top Keywords

gpt-35 gpt-4

clinical decision-making

health care

gpt-4

assessing generative

generative pretrained

pretrained transformers

transformers gpt

clinical

gpt clinical

Similar Publications

Large language models can accurately populate Vascular Quality Initiative procedural databases using narrative operative reports.

J Vasc Surg

December 2024

Division of Vascular and Endovascular Surgery, Department of Surgery, University of California San Francisco, San Francisco, CA.

Colleen P Flanagan Karen Trang Joyce Nacario Peter A Schneider Warren J Gasper

Objective: Participation in the Vascular Quality Initiative (VQI) provides important resources to surgeons, but the ability to do so is often limited by time and data entry personnel. Large language models (LLMs) such as ChatGPT (OpenAI) are examples of generative artificial intelligence products that may help bridge this gap. Trained on large volumes of data, the models are used for natural language processing and text generation.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!