The integration of large language models (LLMs) into clinical diagnostics has the potential to transform doctor-patient interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical examinations, CRAFT-MD focuses on natural dialogues, using simulated artificial intelligence agents to interact with LLMs in a controlled environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4, GPT-3.5, Mistral and LLaMA-2-7b across 12 medical specialties. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history-taking and diagnostic accuracy. These limitations also persisted when analyzing multimodal conversational and visual assessment capabilities of GPT-4V. We propose a comprehensive set of recommendations for future evaluations of clinical LLMs based on our empirical findings. These recommendations emphasize realistic doctor-patient conversations, comprehensive history-taking, open-ended questioning and using a combination of automated and expert evaluations. The introduction of CRAFT-MD marks an advancement in testing of clinical LLMs, aiming to ensure that these models augment medical practice effectively and ethically.

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41591-024-03328-5DOI Listing

Publication Analysis

Top Keywords

clinical llms
12
large language
8
language models
8
conversational reasoning
8
clinical
7
llms
6
evaluation framework
4
framework clinical
4
clinical large
4
models
4

Similar Publications

Generative artificial intelligence (AI) technologies have the potential to revolutionise healthcare delivery but require classification and monitoring of patient safety risks. To address this need, we developed and evaluated a preliminary classification system for categorising generative AI patient safety errors. Our classification system is organised around two AI system stages (input and output) with specific error types by stage.

View Article and Find Full Text PDF

Ethical engagement with artificial intelligence in medical education.

Adv Physiol Educ

January 2025

Assistant Professor, Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand - 814152, India.

The integration of large language models (LLMs) in medical education offers both opportunities and challenges. While these AI-driven tools can enhance access to information and support critical thinking, they also pose risks like potential overreliance and ethical concerns. To ensure ethical use, students and instructors must recognize the limitations of LLMs, maintain academic integrity, handle data cautiously, and instructors should prioritize content quality over AI detection methods.

View Article and Find Full Text PDF

The integration of large language models (LLMs) into clinical diagnostics has the potential to transform doctor-patient interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs.

View Article and Find Full Text PDF

Background: In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of large language models in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.

View Article and Find Full Text PDF

Purpose: Information retrieval (IR) and risk assessment (RA) from multi-modality imaging and pathology reports are critical to prostate cancer (PC) treatment. This study aims to evaluate the performance of four general-purpose large language model (LLMs) in IR and RA tasks.

Materials And Methods: We conducted a study using simulated text reports from computed tomography, magnetic resonance imaging, bone scans, and biopsy pathology on stage IV PC patients.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!