Evaluating the Adherence of Large Language Models to Surgical Guidelines: A Comparative Analysis of Chatbot Recommendations and North American Spine Society (NASS) Coverage Criteria.

Advith Sarikonda Emily Isch Mitchell Self Abhijeet Sambangi Angeleah Carreras Ahilan Sivaganesan Jim Harrop Jack Jallo

Cureus

Department of Neurosurgery, Thomas Jefferson Medical College, Philadelphia, USA.

Published: September 2024

There has been a growing number of cervical fusion surgeries in the U.S., but there's a lack of research on how well surgeons follow evidence-based medicine (EBM) guidelines, particularly as patients turn to large language models (LLMs) for decision-making assistance.* -
An observational study tested four LLMs—Bard, BingAI, ChatGPT-3.5, and ChatGPT-4—against the 2023 North American Spine Society (NASS) cervical fusion guidelines, and found that none fully adhered, with only ChatGPT-4 and Bing Chat achieving 60% compliance.* -
The findings suggest a need for better training of LLMs on clinical guidelines and highlight the necessity of

Background There has been a significant increase in cervical fusion procedures, both anterior and posterior, across the United States. Despite this upward trend, limited research exists on adherence to evidence-based medicine (EBM) guidelines for cervical fusion, highlighting a gap between recommended practices and surgeon preferences. Additionally, patients are increasingly utilizing large language models (LLMs) to aid in decision-making. Methodology This observational study evaluated the capacity of four LLMs, namely, Bard, BingAI, ChatGPT-3.5, and ChatGPT-4, to adhere to EBM guidelines, specifically the 2023 North American Spine Society (NASS) cervical fusion guidelines. Ten clinical vignettes were created based on NASS recommendations to determine when fusion was indicated. This novel approach assessed LLM performance in a clinical decision-making context without requiring institutional review board approval, as no human subjects were involved. Results No LLM achieved complete concordance with NASS guidelines, though ChatGPT-4 and Bing Chat exhibited the highest adherence at 60%. Discrepancies were notably observed in scenarios involving head-drop syndrome and pseudoarthrosis, where all LLMs failed to align with NASS recommendations. Additionally, only 25% of LLMs agreed with NASS guidelines for fusion in cases of cervical radiculopathy and as an adjunct to facet cyst resection. Conclusions The study underscores the need for improved LLM training on clinical guidelines and emphasizes the importance of considering the nuances of individual patient cases. While LLMs hold promise for enhancing guideline adherence in cervical fusion decision-making, their current performance indicates a need for further refinement and integration with clinical expertise to ensure optimal patient care. This study contributes to understanding the role of AI in healthcare, advocating for a balanced approach that leverages technological advancements while acknowledging the complexities of surgical decision-making.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11448007	PMC
http://dx.doi.org/10.7759/cureus.68521	DOI Listing

Publication Analysis

Top Keywords

cervical fusion

large language

language models

north american

american spine

spine society

society nass

ebm guidelines

nass recommendations

nass guidelines

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!