Background Context: Generative artificial intelligence (AI), ChatGPT being the most popular example, has been extensively assessed for its capability to respond to medical questions, such as queries in spine treatment approaches or technological advances. However, it often lacks scientific foundation or fabricates inauthentic references, also known as AI hallucinations.

Purpose: To develop an understanding of the scientific basis of generative AI tools by studying the authenticity of references and reliability in comparison to the alignment of responses of evidence-based guidelines.

Study Design: Comparative Study METHODS: Thirty-three previously published North American Spine Society (NASS) guideline questions were posed as prompts to two freely available generative AI tools (Tools I and II). The responses were scored for correctness compared with the published NASS guideline responses using a five-point "alignment score." Furthermore, all cited references were evaluated for authenticity, source type, year of publication, and inclusion in the scientific guidelines.

Results: Both tools' responses to guideline questions achieved an overall score of 3.5±1.1, which is considered acceptable to be equivalent to the guideline. Both tools generated 254 references to support their responses, of which 76.0% (n = 193) were authentic and 24.0% (n = 61) were fabricated. From these, authentic references were: peer-reviewed scientific research papers (147, 76.2%), guidelines (16, 8.3%), educational websites (9, 4.7%), books (9, 4.7%), a government website (1, 0.5%), insurance websites (6, 3.1%) and newspaper websites (5, 2.6%). Claude referenced significantly more authentic peer-reviewed scientific papers (Claude: n = 111, 91.0%; Gemini: n = 36, 50.7%; p< 0.001). The year of publication amongst all references ranged from 1988-2023, with significantly older references provided by Claude (Claude: 2008±6; Gemini: 2014±6; p< 0.001). Lastly, significantly more references provided by Claude were also referenced in the published NASS guidelines (Claude: n = 27, 24.3%; Gemini: n = 1, 2.8%; p = 0.04).

Conclusions: Both generative AI tools provided responses that had acceptable alignment with NASS evidence-based guideline recommendations and offered references, though nearly a quarter of the references were inauthentic or non-scientific sources. This deficiency of legitimate scientific references does not meet standards for clinical implementation. Considering this limitation, caution should be exercised when applying the output of generative AI tools to clinical applications.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.spinee.2025.02.010DOI Listing

Publication Analysis

Top Keywords

generative tools
16
references
11
nass guideline
8
guideline questions
8
published nass
8
year publication
8
peer-reviewed scientific
8
scientific papers
8
claude referenced
8
references provided
8

Similar Publications

Background: Screening for cognitive impairment in primary care is important, yet primary care physicians (PCPs) report conducting routine cognitive assessments for less than half of patients older than 60 years of age. Linus Health's Core Cognitive Evaluation (CCE), a tablet-based digital cognitive assessment, has been used for the detection of cognitive impairment, but its application in primary care is not yet studied.

Objective: This study aimed to explore the integration of CCE implementation in a primary care setting.

View Article and Find Full Text PDF

Background: Artificial intelligence (AI) has been deemed revolutionary in medicine; however, no AI tools have been implemented or validated in Danish general practice. General practice in Denmark has an excellent digitization system for developing and using AI. Nevertheless, there is a lack of involvement of general practitioners (GPs) in developing AI.

View Article and Find Full Text PDF

BackgroundLarge language models (LLMs) are advanced tools capable of understanding and generating human-like text. This study evaluated the accuracy of several commercial LLMs in addressing clinical questions related to diagnosis and management of acute cholecystitis, as outlined in the Tokyo Guidelines 2018 (TG18). We assessed their congruence with the expert panel discussions presented in the guidelines.

View Article and Find Full Text PDF

Cognitive offloading refers to the use of external tools to assist in memory processes.This study investigates the effects of item difficulty and value on cognitive offloading during a word-pair learning task, comparing children and young adults in a context where both cues coexist. In Experiment 1, we examined the impact of difficulty and value cues on cognitive offloading using a 2 (difficulty: easy vs.

View Article and Find Full Text PDF

Background: Fibromyalgia syndrome (FMS) is a chronic condition causing widespread pain, fatigue, and sleep disturbances. Conventional treatments often provide limited relief, leading to growing interest in complementary therapies like ozone therapy.

Objective: This study aims to retrospectively evaluate the short- and medium-term efficacy of ozone therapy in patients with FMS, focusing on changes in pain, functional status, sleep quality, fatigue, anxiety, and depression.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!