Background Context: Clinical guidelines, developed in concordance with the literature, are often used to guide surgeons' clinical decision making. Recent advancements of large language models and artificial intelligence (AI) in the medical field come with exciting potential. OpenAI's generative AI model, known as ChatGPT, can quickly synthesize information and generate responses grounded in medical literature, which may prove to be a useful tool in clinical decision-making for spine care. The current literature has yet to investigate the ability of ChatGPT to assist clinical decision making with regard to degenerative spondylolisthesis.

Purpose: The study aimed to compare ChatGPT's concordance with the recommendations set forth by The North American Spine Society (NASS) Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and assess ChatGPT's accuracy within the context of the most recent literature.

Methods: ChatGPT-3.5 and 4.0 was prompted with questions from the NASS Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and graded its recommendations as "concordant" or "nonconcordant" relative to those put forth by NASS. A response was considered "concordant" when ChatGPT generated a recommendation that accurately reproduced all major points made in the NASS recommendation. Any responses with a grading of "nonconcordant" were further stratified into two subcategories: "Insufficient" or "Over-conclusive," to provide further insight into grading rationale. Responses between GPT-3.5 and 4.0 were compared using Chi-squared tests.

Results: ChatGPT-3.5 answered 13 of NASS's 28 total clinical questions in concordance with NASS's guidelines (46.4%). Categorical breakdown is as follows: Definitions and Natural History (1/1, 100%), Diagnosis and Imaging (1/4, 25%), Outcome Measures for Medical Intervention and Surgical Treatment (0/1, 0%), Medical and Interventional Treatment (4/6, 66.7%), Surgical Treatment (7/14, 50%), and Value of Spine Care (0/2, 0%). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-3.5 generated a concordant response 66.7% of the time (6/9). However, ChatGPT-3.5's concordance dropped to 36.8% when asked clinical questions that NASS did not provide a clear recommendation on (7/19). A further breakdown of ChatGPT-3.5's nonconcordance with the guidelines revealed that a vast majority of its inaccurate recommendations were due to them being "over-conclusive" (12/15, 80%), rather than "insufficient" (3/15, 20%). ChatGPT-4.0 answered 19 (67.9%) of the 28 total questions in concordance with NASS guidelines (P = 0.177). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-4.0 generated a concordant response 66.7% of the time (6/9). ChatGPT-4.0's concordance held up at 68.4% when asked clinical questions that NASS did not provide a clear recommendation on (13/19, P = 0.104).

Conclusions: This study sheds light on the duality of LLM applications within clinical settings: one of accuracy and utility in some contexts versus inaccuracy and risk in others. ChatGPT was concordant for most clinical questions NASS offered recommendations for. However, for questions NASS did not offer best practices, ChatGPT generated answers that were either too general or inconsistent with the literature, and even fabricated data/citations. Thus, clinicians should exercise extreme caution when attempting to consult ChatGPT for clinical recommendations, taking care to ensure its reliability within the context of recent literature.

Download full-text PDF

Source
http://dx.doi.org/10.1007/s00586-024-08198-6DOI Listing

Publication Analysis

Top Keywords

questions nass
20
clinical questions
16
clear recommendation
16
clinical
13
nass
12
nass clinical
12
degenerative spondylolisthesis
12
clinical guidelines
8
clinical decision
8
decision making
8

Similar Publications

Objectives: Evidence has shown that lesbian, gay, bisexual, queer (LGBQ) and transgender patients (LGBTQ) experience disparities in health care delivery and clinical outcomes. As the predominant U.S.

View Article and Find Full Text PDF
Article Synopsis
  • ChatGPT-3.5 and ChatGPT-4.0 were tested on their ability to answer clinical questions related to lumbar disc herniation, based on established NASS guidelines, with a focus on response accuracy and completeness.
  • ChatGPT-4.0 outperformed ChatGPT-3.5, achieving 67% accuracy compared to 47%, and significantly more supplementary information, while both had the same level of incompleteness (40%).
  • Diagnostic testing questions were answered perfectly by ChatGPT-4.0, while ChatGPT-3.5 scored 0%, highlighting a notable improvement with the newer version of the AI.
View Article and Find Full Text PDF

Features of Importance in Nasal Endoscopy: Deriving a Meaningful Framework.

Otolaryngol Head Neck Surg

October 2024

Department of Otorhinolaryngology, Ochsner Health, New Orleans, Louisiana, USA.

Objective: Critical components of the nasal endoscopic examination have not been definitively established for either the normal examination or for clinical disorders. This study aimed to identify concordance among rhinologists regarding the importance of examination findings for various nasal pathologies.

Study Design: A consortium of 19 expert rhinologists across the United States was asked to rank the importance of findings on nasal endoscopy for 5 different sinonasal symptom presentations.

View Article and Find Full Text PDF
Article Synopsis
  • The study aimed to evaluate ChatGPT's safety and accuracy in diagnosing and treating cervical radiculopathy compared to established guidelines from the North American Spine Society (NASS).
  • ChatGPT-4 showed a mean completeness of responses at 46%, outperforming ChatGPT-3.5, which had a completeness of 34%, but both versions were found to be difficult to read.
  • Despite the complexity, both ChatGPT versions received a 100% safety rating from a senior spine surgeon, indicating they are safe to use in a clinical context.
View Article and Find Full Text PDF
Article Synopsis
  • - Secondary prevention with penicillin is crucial to avoiding repeat acute rheumatic fever and reducing the risk of rheumatic heart disease (RHD), though penicillin allergy reported by 10% of the population complicates this effort.
  • - A comprehensive review of literature revealed no studies specifically addressing penicillin allergy testing in our initial context, but findings from other populations indicated low confirmed allergy rates and very limited severe reactions (less than 1-3 per 1000 treated).
  • - Research on penicillin allergy delabeling showed that direct oral drug challenges resulted in fewer minor allergic reactions compared to skin testing, with no cases of anaphylaxis or fatalities; confirming or clearing penicillin allergies appears safe and has a low risk of adverse
View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!