The future of AI clinicians: assessing the modern standard of chatbots and their approach to diagnostic uncertainty.

Ryan S Huang Ali Benour Joel Kemppainen Fok-Han Leung

BMC Med Educ

Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada.

Published: October 2024

The study assesses how well AI chatbots GPT-4o and Claude-3 perform in handling medical questions with diagnostic uncertainty, compared to Family Medicine residents.
Researchers used questions from Progress Tests taken by Family Medicine residents, focusing on scenarios lacking clear diagnoses that require complex reasoning to solve.
Results showed that both chatbots scored lower than the residents on these challenging questions, with GPT-4o scoring 53.3% and Claude-3 scoring 57.7%, while the residents scored around 61-63%.

Background: Artificial intelligence (AI) chatbots have demonstrated proficiency in structured knowledge assessments; however, there is limited research on their performance in scenarios involving diagnostic uncertainty, which requires careful interpretation and complex decision-making. This study aims to evaluate the efficacy of AI chatbots, GPT-4o and Claude-3, in addressing medical scenarios characterized by diagnostic uncertainty relative to Family Medicine residents.

Methods: Questions with diagnostic uncertainty were extracted from the Progress Tests administered by the Department of Family and Community Medicine at the University of Toronto between 2022 and 2023. Diagnostic uncertainty questions were defined as those presenting clinical scenarios where symptoms, clinical findings, and patient histories do not converge on a definitive diagnosis, necessitating nuanced diagnostic reasoning and differential diagnosis. These questions were administered to a cohort of 320 Family Medicine residents in their first (PGY-1) and second (PGY-2) postgraduate years and inputted into GPT-4o and Claude-3. Errors were categorized into statistical, information, and logical errors. Statistical analyses were conducted using a binomial generalized estimating equation model, paired t-tests, and chi-squared tests.

Results: Compared to the residents, both chatbots scored lower on diagnostic uncertainty questions (p < 0.01). PGY-1 residents achieved a correctness rate of 61.1% (95% CI: 58.4-63.7), and PGY-2 residents achieved 63.3% (95% CI: 60.7-66.1). In contrast, Claude-3 correctly answered 57.7% (n = 52/90) of questions, and GPT-4o correctly answered 53.3% (n = 48/90). Claude-3 had a longer mean response time (24.0 s, 95% CI: 21.0-32.5 vs. 12.4 s, 95% CI: 9.3-15.3; p < 0.01) and produced longer answers (2001 characters, 95% CI: 1845-2212 vs. 1596 characters, 95% CI: 1395-1705; p < 0.01) compared to GPT-4o. Most errors by GPT-4o were logical errors (62.5%).

Conclusions: While AI chatbots like GPT-4o and Claude-3 demonstrate potential in handling structured medical knowledge, their performance in scenarios involving diagnostic uncertainty remains suboptimal compared to human residents.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11470580	PMC
http://dx.doi.org/10.1186/s12909-024-06115-5	DOI Listing

Publication Analysis

Top Keywords

diagnostic uncertainty

gpt-4o claude-3

family medicine

uncertainty questions

diagnostic

uncertainty

future clinicians

clinicians assessing

assessing modern

modern standard

Similar Publications

Small bowel lymphangioma: a multidisciplinary approach to diagnostic uncertainty.

BMJ Case Rep

January 2025

General Surgery, Whipps Cross University Hospital NHS Trust, London, UK.

Molly Mead Nichols Ruqayya Lockhart Michael Rathbone Nicholas Reading Vinayata Sheshappanavar

Intra-abdominal lymphangioma, a rare benign lymphatic malformation resulting from an obstruction to lymphatic channels, often has non-specific clinical manifestations. Low incidence rates of this condition, paired with its unusual presentation and ambiguous radiological appearance, commonly lead to diagnostic uncertainty. This pathology can result in significant morbidity and mortality, emphasising the need to achieve early diagnosis and management despite these challenges.

View Article and Find Full Text PDF

Similar Publications

Getting back on track after treatment of cancer: A qualitative interview study of cancer survivors' experiences.

PLoS One

January 2025

Tranzo, Scientific Center for Care and Wellbeing, Tilburg School of Social and Behavioral Sciences, Tilburg University, Tilburg, The Netherlands.

Doris van der Smissen Marjolein Lugtenberg Manon Enting Laurens Beerepoot Floortje Mols

Objective: An increasing number of people resumes life after cancer treatment. Although the (long-term) side-effects of cancer and its treatment can be significant, less is known about the impact on cancer survivors' participation in daily life. The aim of this study was to explore the common experiences of cancer survivors in resuming life after treatment.

View Article and Find Full Text PDF

Similar Publications

Attractive and repulsive visual aftereffects depend on stimulus contrast.

J Vis

January 2025

Laboratoire des Systèmes Perceptifs, Département d'études cognitives, École normale supérieure, PSL University, France.

Nikos Gekas Pascal Mamassian

Visual perception has been described as a dynamic process where incoming visual information is combined with what has been seen before to form the current percept. Such a process can result in multiple visual aftereffects that can be attractive toward or repulsive away from past visual stimulation. A lot of research has been conducted on what functional role the mechanisms that produce these aftereffects may play.

View Article and Find Full Text PDF

Similar Publications

Biomarkers.

Alzheimers Dement

December 2024

Neurology Department Infanta Leonor Hospital, Madrid, Spain.

Shenda Orrego Mihaela Sava Maria Pilar Garcia María Sagrario Manzano

Background: biomarkers are essential in order to make a diagnosis with a high level of accuracy in patients with cognitive and behavior complaints. However, molecular imaging biomarkers not always provide an answer in daily clinical practice.

Methods: retrospective and descriptive study in patients with Amyloid PET (APscans) implemented according to rational use of this technic, between January 2019-November 2023 in Neurology Department, Infanta Leonor Hospital, Madrid, Spain.

View Article and Find Full Text PDF

Similar Publications

Biomarkers.

Alzheimers Dement

December 2024

GE HealthCare, Amersham, UK.

Ariane Bollack Lyduine E Collij Mahnaz Shekari Santiago Bullich Núria Roé-Vellvé

Background: The Centiloid method (CL) was introduced as a tracer-independent measure for cortical amyloid load and is now commonly used in Alzheimer's disease (AD) clinical trials. To facilitate its implementation into clinical settings, the AMYPAD consortium set out to integrate existing literature and recent work from the consortium to provide clinical context-of-use recommendations of the Centiloid scale, which has been submitted to the European Medicine Agency for endorsement as a Biomarker Qualification Opinion.

Method: Screening of the literature was performed on the 7/11/23 on PubMed to identify articles mentioning "Centiloid".

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!