The integration of large language models (LLMs) into clinical diagnostics has the potential to transform doctor-patient interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical examinations, CRAFT-MD focuses on natural dialogues, using simulated artificial intelligence agents to interact with LLMs in a controlled environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4, GPT-3.5, Mistral and LLaMA-2-7b across 12 medical specialties. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history-taking and diagnostic accuracy. These limitations also persisted when analyzing multimodal conversational and visual assessment capabilities of GPT-4V. We propose a comprehensive set of recommendations for future evaluations of clinical LLMs based on our empirical findings. These recommendations emphasize realistic doctor-patient conversations, comprehensive history-taking, open-ended questioning and using a combination of automated and expert evaluations. The introduction of CRAFT-MD marks an advancement in testing of clinical LLMs, aiming to ensure that these models augment medical practice effectively and ethically.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1038/s41591-024-03328-5 | DOI Listing |
BMJ Qual Saf
January 2025
National Center for Human Factors in Healthcare, MedStar Health Research Institute, Washington, District of Columbia, USA.
Generative artificial intelligence (AI) technologies have the potential to revolutionise healthcare delivery but require classification and monitoring of patient safety risks. To address this need, we developed and evaluated a preliminary classification system for categorising generative AI patient safety errors. Our classification system is organised around two AI system stages (input and output) with specific error types by stage.
View Article and Find Full Text PDFAdv Physiol Educ
January 2025
Assistant Professor, Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand - 814152, India.
The integration of large language models (LLMs) in medical education offers both opportunities and challenges. While these AI-driven tools can enhance access to information and support critical thinking, they also pose risks like potential overreliance and ethical concerns. To ensure ethical use, students and instructors must recognize the limitations of LLMs, maintain academic integrity, handle data cautiously, and instructors should prioritize content quality over AI detection methods.
View Article and Find Full Text PDFNat Med
January 2025
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
The integration of large language models (LLMs) into clinical diagnostics has the potential to transform doctor-patient interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs.
View Article and Find Full Text PDFJMIR Med Inform
January 2025
Servicio Oncologia Radioterápica, Hospital Universitario Virgen Macarena, Andalusian Health Service, Seville, Spain.
Background: In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of large language models in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.
View Article and Find Full Text PDFWorld J Mens Health
December 2024
Division of Urology, Department of Surgery, Far Eastern Memorial Hospital, New Taipei, Taiwan.
Purpose: Information retrieval (IR) and risk assessment (RA) from multi-modality imaging and pathology reports are critical to prostate cancer (PC) treatment. This study aims to evaluate the performance of four general-purpose large language model (LLMs) in IR and RA tasks.
Materials And Methods: We conducted a study using simulated text reports from computed tomography, magnetic resonance imaging, bone scans, and biopsy pathology on stage IV PC patients.
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!