Background Context: Generative artificial intelligence (AI), ChatGPT being the most popular example, has been extensively assessed for its capability to respond to medical questions, such as queries in spine treatment approaches or technological advances. However, it often lacks scientific foundation or fabricates inauthentic references, also known as AI hallucinations.
Purpose: To develop an understanding of the scientific basis of generative AI tools by studying the authenticity of references and reliability in comparison to the alignment of responses of evidence-based guidelines.
Study Design: Comparative Study METHODS: Thirty-three previously published North American Spine Society (NASS) guideline questions were posed as prompts to two freely available generative AI tools (Tools I and II). The responses were scored for correctness compared with the published NASS guideline responses using a five-point "alignment score." Furthermore, all cited references were evaluated for authenticity, source type, year of publication, and inclusion in the scientific guidelines.
Results: Both tools' responses to guideline questions achieved an overall score of 3.5±1.1, which is considered acceptable to be equivalent to the guideline. Both tools generated 254 references to support their responses, of which 76.0% (n = 193) were authentic and 24.0% (n = 61) were fabricated. From these, authentic references were: peer-reviewed scientific research papers (147, 76.2%), guidelines (16, 8.3%), educational websites (9, 4.7%), books (9, 4.7%), a government website (1, 0.5%), insurance websites (6, 3.1%) and newspaper websites (5, 2.6%). Claude referenced significantly more authentic peer-reviewed scientific papers (Claude: n = 111, 91.0%; Gemini: n = 36, 50.7%; p< 0.001). The year of publication amongst all references ranged from 1988-2023, with significantly older references provided by Claude (Claude: 2008±6; Gemini: 2014±6; p< 0.001). Lastly, significantly more references provided by Claude were also referenced in the published NASS guidelines (Claude: n = 27, 24.3%; Gemini: n = 1, 2.8%; p = 0.04).
Conclusions: Both generative AI tools provided responses that had acceptable alignment with NASS evidence-based guideline recommendations and offered references, though nearly a quarter of the references were inauthentic or non-scientific sources. This deficiency of legitimate scientific references does not meet standards for clinical implementation. Considering this limitation, caution should be exercised when applying the output of generative AI tools to clinical applications.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.spinee.2025.02.010 | DOI Listing |
JMIR Form Res
March 2025
Program in Digital Medicine, Department of Medicine, University of Massachusetts Chan Medical School, Worcester, MA, United States.
Background: Screening for cognitive impairment in primary care is important, yet primary care physicians (PCPs) report conducting routine cognitive assessments for less than half of patients older than 60 years of age. Linus Health's Core Cognitive Evaluation (CCE), a tablet-based digital cognitive assessment, has been used for the detection of cognitive impairment, but its application in primary care is not yet studied.
Objective: This study aimed to explore the integration of CCE implementation in a primary care setting.
JMIR Med Inform
March 2025
Center for General Practice at Aalborg University, Department of Clinical Medicine, Aalborg University, Selma Lagerløfs vej 249, Aalborg, 9260 Gistrup, Denmark, 45 29807944.
Background: Artificial intelligence (AI) has been deemed revolutionary in medicine; however, no AI tools have been implemented or validated in Danish general practice. General practice in Denmark has an excellent digitization system for developing and using AI. Nevertheless, there is a lack of involvement of general practitioners (GPs) in developing AI.
View Article and Find Full Text PDFAm Surg
March 2025
Department of Surgery, Sapienza University of Rome, Rome, Italy.
BackgroundLarge language models (LLMs) are advanced tools capable of understanding and generating human-like text. This study evaluated the accuracy of several commercial LLMs in addressing clinical questions related to diagnosis and management of acute cholecystitis, as outlined in the Tokyo Guidelines 2018 (TG18). We assessed their congruence with the expert panel discussions presented in the guidelines.
View Article and Find Full Text PDFPsychol Res
March 2025
School of Education, Guangzhou University, Guangzhou, 510006, People's Republic of China.
Cognitive offloading refers to the use of external tools to assist in memory processes.This study investigates the effects of item difficulty and value on cognitive offloading during a word-pair learning task, comparing children and young adults in a context where both cues coexist. In Experiment 1, we examined the impact of difficulty and value cues on cognitive offloading using a 2 (difficulty: easy vs.
View Article and Find Full Text PDFBackground: Fibromyalgia syndrome (FMS) is a chronic condition causing widespread pain, fatigue, and sleep disturbances. Conventional treatments often provide limited relief, leading to growing interest in complementary therapies like ozone therapy.
Objective: This study aims to retrospectively evaluate the short- and medium-term efficacy of ozone therapy in patients with FMS, focusing on changes in pain, functional status, sleep quality, fatigue, anxiety, and depression.
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!