Clinical decision-making is one of the most impactful parts of a physician's responsibilities and stands to benefit greatly from artificial intelligence solutions and large language models (LLMs) in particular. However, while LLMs have achieved excellent performance on medical licensing exams, these tests fail to assess many skills necessary for deployment in a realistic clinical decision-making environment, including gathering information, adhering to guidelines, and integrating into clinical workflows. Here we have created a curated dataset based on the Medical Information Mart for Intensive Care database spanning 2,400 real patient cases and four common abdominal pathologies as well as a framework to simulate a realistic clinical setting. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients. Furthermore, we move beyond diagnostic accuracy and demonstrate that they cannot be easily integrated into existing workflows because they often fail to follow instructions and are sensitive to both the quantity and order of information. Overall, our analysis reveals that LLMs are currently not ready for autonomous clinical decision-making while providing a dataset and framework to guide future studies.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11405275PMC
http://dx.doi.org/10.1038/s41591-024-03097-1DOI Listing

Publication Analysis

Top Keywords

clinical decision-making
16
large language
8
language models
8
realistic clinical
8
clinical
6
evaluation mitigation
4
mitigation limitations
4
limitations large
4
models clinical
4
decision-making
4

Similar Publications

In modern knee arthroplasty, surgeons increasingly aim for individualised implant selection based on data-driven decisions to improve patient satisfaction rates. The identification of an implant design that optimally fits to a patient's native kinematic patterns and functional requirements could provide a basis towards subject-specific phenotyping. The goal of this study was to achieve a first step towards identifying easily accessible and intuitive features that allow for discrimination between implant designs based on kinematic data.

View Article and Find Full Text PDF

Background: Immune checkpoint inhibitors (ICIs) in combination with antiangiogenic drugs have shown promising outcomes in the third-line and subsequent treatments of patients with microsatellite stable metastatic colorectal cancer (MSS-mCRC). Radiotherapy (RT) may enhance the antitumor effect of immunotherapy. However, the effect of RT exposure on patients receiving ICIs and targeted therapy remains unclear.

View Article and Find Full Text PDF

Machine learning to predict the decision to perform surgery in hepatic echinococcosis.

HPB (Oxford)

December 2024

Fondazione IRCCS Policlinico San Matteo, SC Chirurgia Generale 1, Pavia, Italy. Electronic address:

Background: Cystic echinococcosis (CE) is a significant public health issue, primarily affecting the liver. While several management strategies exist, there is a lack of predictive tools to guide surgical decisions for hepatic CE. This study aimed to develop predictive models to support surgical decision-making in hepatic CE, enhancing the precision of patient allocation to surgical or non-surgical management pathways.

View Article and Find Full Text PDF

Reliability and reproducibility of systematic reviews informing the 2020-2025 Dietary Guidelines for Americans: a pilot study.

Am J Clin Nutr

January 2025

School of Epidemiology and Public Health, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada; Clinical Epidemiology Program, Ottawa Hospital Research Institute, Ottawa, ON, Canada.

Background: Although high-quality nutrition systematic reviews (SRs) are important for clinical decision making, there remains debate on their methodological quality and reporting transparency.

Objectives: The objective of this study was to assess the reliability and reproducibility of a sample of SRs produced by the Nutrition Evidence Systematic Review (NESR) team to inform the 2020-2025 Dietary Guidelines for Americans (DGAs).

Methods: We evaluated a sample of 8 SRs from the DGA dietary patterns subcommittee for methodological quality using the Assessment of Multiple Systematic Reviews 2 (AMSTAR 2) tool and for reporting transparency using the PRISMA 2020 and PRISMA literature search extension (PRISMA-S) checklists.

View Article and Find Full Text PDF

Defining the optimal radiation-induced lymphopenia metric to discern its survival impact in esophageal cancer.

Int J Radiat Oncol Biol Phys

January 2025

Department of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston TX, United States of America; Department of Radiation Oncology, Amsterdam UMC, Amsterdam, The Netherlands.

Background: A detrimental association between radiation-induced lymphopenia (RIL) and oncologic outcomes in esophageal cancer patients has been established. However, an optimal metric for RIL remains undefined, but is important for application of this knowledge in clinical decision-making and trial designs. The aim of this study was to find the optimal RIL metric discerning survival.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!