An evaluation framework for clinical use of large language models in patient interaction tasks.

Shreya Johri Jaehwan Jeong Benjamin A Tran Daniel I Schlessinger Shannon Wongvibulsin Leandra A Barnes Hong-Yu Zhou Zhuo Ran Cai Eliezer M Van Allen David Kim Roxana Daneshjou Pranav Rajpurkar

Nat Med

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

Published: January 2025

The integration of large language models (LLMs) into clinical diagnostics has the potential to transform doctor-patient interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical examinations, CRAFT-MD focuses on natural dialogues, using simulated artificial intelligence agents to interact with LLMs in a controlled environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4, GPT-3.5, Mistral and LLaMA-2-7b across 12 medical specialties. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history-taking and diagnostic accuracy. These limitations also persisted when analyzing multimodal conversational and visual assessment capabilities of GPT-4V. We propose a comprehensive set of recommendations for future evaluations of clinical LLMs based on our empirical findings. These recommendations emphasize realistic doctor-patient conversations, comprehensive history-taking, open-ended questioning and using a combination of automated and expert evaluations. The introduction of CRAFT-MD marks an advancement in testing of clinical LLMs, aiming to ensure that these models augment medical practice effectively and ethically.

Download full-text PDF	Source
http://dx.doi.org/10.1038/s41591-024-03328-5	DOI Listing

Publication Analysis

Top Keywords

clinical llms

large language

language models

conversational reasoning

clinical

llms

evaluation framework

framework clinical

clinical large

models

Similar Publications

Development of a Preliminary Patient Safety Classification System for Generative AI.

BMJ Qual Saf

January 2025

National Center for Human Factors in Healthcare, MedStar Health Research Institute, Washington, District of Columbia, USA.

Bat-Zion Hose Jessica L Handley Joshua Biro Sahithi Reddy Seth Krevat

Generative artificial intelligence (AI) technologies have the potential to revolutionise healthcare delivery but require classification and monitoring of patient safety risks. To address this need, we developed and evaluated a preliminary classification system for categorising generative AI patient safety errors. Our classification system is organised around two AI system stages (input and output) with specific error types by stage.

View Article and Find Full Text PDF

Similar Publications

Ethical engagement with artificial intelligence in medical education.

Adv Physiol Educ

January 2025

Assistant Professor, Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand - 814152, India.

Himel Mondal

The integration of large language models (LLMs) in medical education offers both opportunities and challenges. While these AI-driven tools can enhance access to information and support critical thinking, they also pose risks like potential overreliance and ethical concerns. To ensure ethical use, students and instructors must recognize the limitations of LLMs, maintain academic integrity, handle data cautiously, and instructors should prioritize content quality over AI detection methods.

View Article and Find Full Text PDF

Similar Publications

An evaluation framework for clinical use of large language models in patient interaction tasks.

Nat Med

January 2025

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

Shreya Johri Jaehwan Jeong Benjamin A Tran Daniel I Schlessinger Shannon Wongvibulsin

View Article and Find Full Text PDF

Similar Publications

The Transformative Potential of Large Language Models in Mining Electronic Health Records Data: Content Analysis.

JMIR Med Inform

January 2025

Servicio Oncologia Radioterápica, Hospital Universitario Virgen Macarena, Andalusian Health Service, Seville, Spain.

Amadeo Jesus Wals Zurita Hector Miras Del Rio Nerea Ugarte Ruiz de Aguirre Cristina Nebrera Navarro Maria Rubio Jimenez

Background: In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of large language models in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.

View Article and Find Full Text PDF

Similar Publications

The In-depth Comparative Analysis of Four Large Language AI Models for Risk Assessment and Information Retrieval from Multi-Modality Prostate Cancer Work-up Reports.

World J Mens Health

December 2024

Division of Urology, Department of Surgery, Far Eastern Memorial Hospital, New Taipei, Taiwan.

Lun-Hsiang Yuan Shi-Wei Huang Dean Chou Chung-You Tsai

Purpose: Information retrieval (IR) and risk assessment (RA) from multi-modality imaging and pathology reports are critical to prostate cancer (PC) treatment. This study aims to evaluate the performance of four general-purpose large language model (LLMs) in IR and RA tasks.

Materials And Methods: We conducted a study using simulated text reports from computed tomography, magnetic resonance imaging, bone scans, and biopsy pathology on stage IV PC patients.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!