The Double-Edged Sword of Generative AI: Surpassing an Expert or a Deceptive "False Friend"?

Franziska C S Altorfer Michael J Kelly Fedan Avrumova Varun Rohatgi Jiaqi Zhu Christopher M Bono Darren R Lebl

Spine J

Department of Spine Surgery, Hospital for Special Surgery, 523 East 72nd Street, New York, NY, USA. Electronic address:

Published: March 2025

Background Context: Generative artificial intelligence (AI), ChatGPT being the most popular example, has been extensively assessed for its capability to respond to medical questions, such as queries in spine treatment approaches or technological advances. However, it often lacks scientific foundation or fabricates inauthentic references, also known as AI hallucinations.

Purpose: To develop an understanding of the scientific basis of generative AI tools by studying the authenticity of references and reliability in comparison to the alignment of responses of evidence-based guidelines.

Study Design: Comparative Study METHODS: Thirty-three previously published North American Spine Society (NASS) guideline questions were posed as prompts to two freely available generative AI tools (Tools I and II). The responses were scored for correctness compared with the published NASS guideline responses using a five-point "alignment score." Furthermore, all cited references were evaluated for authenticity, source type, year of publication, and inclusion in the scientific guidelines.

Results: Both tools' responses to guideline questions achieved an overall score of 3.5±1.1, which is considered acceptable to be equivalent to the guideline. Both tools generated 254 references to support their responses, of which 76.0% (n = 193) were authentic and 24.0% (n = 61) were fabricated. From these, authentic references were: peer-reviewed scientific research papers (147, 76.2%), guidelines (16, 8.3%), educational websites (9, 4.7%), books (9, 4.7%), a government website (1, 0.5%), insurance websites (6, 3.1%) and newspaper websites (5, 2.6%). Claude referenced significantly more authentic peer-reviewed scientific papers (Claude: n = 111, 91.0%; Gemini: n = 36, 50.7%; p< 0.001). The year of publication amongst all references ranged from 1988-2023, with significantly older references provided by Claude (Claude: 2008±6; Gemini: 2014±6; p< 0.001). Lastly, significantly more references provided by Claude were also referenced in the published NASS guidelines (Claude: n = 27, 24.3%; Gemini: n = 1, 2.8%; p = 0.04).

Conclusions: Both generative AI tools provided responses that had acceptable alignment with NASS evidence-based guideline recommendations and offered references, though nearly a quarter of the references were inauthentic or non-scientific sources. This deficiency of legitimate scientific references does not meet standards for clinical implementation. Considering this limitation, caution should be exercised when applying the output of generative AI tools to clinical applications.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.spinee.2025.02.010	DOI Listing

Publication Analysis

Top Keywords

generative tools

references

nass guideline

guideline questions

published nass

year publication

peer-reviewed scientific

scientific papers

claude referenced

references provided

Similar Publications

Digital Assessment of Cognitive Health in Outpatient Primary Care: Usability Study.

JMIR Form Res

March 2025

Program in Digital Medicine, Department of Medicine, University of Massachusetts Chan Medical School, Worcester, MA, United States.

Adam J Doerr Taylor A Orwig Matthew McNulty Stephanie Denise M Sison David R Paquette

Background: Screening for cognitive impairment in primary care is important, yet primary care physicians (PCPs) report conducting routine cognitive assessments for less than half of patients older than 60 years of age. Linus Health's Core Cognitive Evaluation (CCE), a tablet-based digital cognitive assessment, has been used for the detection of cognitive impairment, but its application in primary care is not yet studied.

Objective: This study aimed to explore the integration of CCE implementation in a primary care setting.

View Article and Find Full Text PDF

Similar Publications

The Perceptions of Potential Prerequisites for Artificial Intelligence in Danish General Practice: Vignette-Based Interview Study Among General Practitioners.

JMIR Med Inform

March 2025

Center for General Practice at Aalborg University, Department of Clinical Medicine, Aalborg University, Selma Lagerløfs vej 249, Aalborg, 9260 Gistrup, Denmark, 45 29807944.

Natasha Lee Jørgensen Camilla Hoffmann Merrild Martin Bach Jensen Thomas B Moeslund Kristian Kidholm

Background: Artificial intelligence (AI) has been deemed revolutionary in medicine; however, no AI tools have been implemented or validated in Danish general practice. General practice in Denmark has an excellent digitization system for developing and using AI. Nevertheless, there is a lack of involvement of general practitioners (GPs) in developing AI.

View Article and Find Full Text PDF

Similar Publications

Using Large Language Models in the Diagnosis of Acute Cholecystitis: Assessing Accuracy and Guidelines Compliance.

Am Surg

March 2025

Department of Surgery, Sapienza University of Rome, Rome, Italy.

Marta Goglia Arianna Cicolani Francesco Maria Carrano Niccolò Petrucciani Francesco D'Angelo

BackgroundLarge language models (LLMs) are advanced tools capable of understanding and generating human-like text. This study evaluated the accuracy of several commercial LLMs in addressing clinical questions related to diagnosis and management of acute cholecystitis, as outlined in the Tokyo Guidelines 2018 (TG18). We assessed their congruence with the expert panel discussions presented in the guidelines.

View Article and Find Full Text PDF

Similar Publications

Differences between children and young adults in the effects of difficulty and value of learning items on cognitive offloading strategies.

Psychol Res

March 2025

School of Education, Guangzhou University, Guangzhou, 510006, People's Republic of China.

Xiaoxiao Dong Jiawei Wang Qiang Xing Jianjun Sun

Cognitive offloading refers to the use of external tools to assist in memory processes.This study investigates the effects of item difficulty and value on cognitive offloading during a word-pair learning task, comparing children and young adults in a context where both cues coexist. In Experiment 1, we examined the impact of difficulty and value cues on cognitive offloading using a 2 (difficulty: easy vs.

View Article and Find Full Text PDF

Similar Publications

Short- and Medium-Term effects of major Ozone therapy on disease parameters in fibromyalgia syndrome: A retrospective study.

Rheumatol Int

March 2025

Adana City Research and Training Hospital, Adana, Türkiye, Turkey.

Ahmet Üşen Didem Sezgin Özcan Mehmet Ağirman Hilal Güner Burhan Fatih Kocyigit

Background: Fibromyalgia syndrome (FMS) is a chronic condition causing widespread pain, fatigue, and sleep disturbances. Conventional treatments often provide limited relief, leading to growing interest in complementary therapies like ozone therapy.

Objective: This study aims to retrospectively evaluate the short- and medium-term efficacy of ozone therapy in patients with FMS, focusing on changes in pain, functional status, sleep quality, fatigue, anxiety, and depression.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!