Background: Artificial intelligence chatbot tools responses might discern patterns and correlations that may elude human observation, leading to more accurate and timely interventions. However, their reliability to answer healthcare-related questions is still debated. This study aimed to assess the performance of the three versions of GPT-based chatbots about prosthetic joint infections (PJI).

Methods: Thirty questions concerning the diagnosis and treatment of hip and knee PJIs, stratified by a priori established difficulty, were generated by a team of experts, and administered to ChatGPT 3.5, BingChat, and ChatGPT 4.0. Responses were rated by three orthopedic surgeons and two infectious diseases physicians using a five-point Likert-like scale with numerical values to quantify the quality of responses. Inter-rater reliability was assessed by interclass correlation statistics.

Results: Responses averaged "good-to-very good" for all chatbots examined, both in diagnosis and treatment, with no significant differences according to the difficulty of the questions. However, BingChat ratings were significantly lower in the treatment setting (p = 0.025), particularly in terms of accuracy (p = 0.02) and completeness (p = 0.004). Agreement in ratings among examiners appeared to be very poor.

Conclusions: On average, the quality of responses is rated positively by experts, but with ratings that frequently may vary widely. This currently suggests that AI chatbot tools are still unreliable in the management of PJI.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11582126PMC
http://dx.doi.org/10.1007/s12306-024-00846-wDOI Listing

Publication Analysis

Top Keywords

chatbot tools
12
tools unreliable
8
unreliable management
8
prosthetic joint
8
joint infections
8
diagnosis treatment
8
responses rated
8
quality responses
8
responses
5
gpt-based chatbot
4

Similar Publications

Generative large language models (LLMs) like ChatGPT can quickly produce informative essays on various topics. However, the information generated cannot be fully trusted as artificial intelligence (AI) can make factual mistakes. This poses challenges for using such tools in college classrooms.

View Article and Find Full Text PDF

Despite extensive studies on large language models and their capability to respond to questions from various licensed exams, there has been limited focus on employing chatbots for specific subjects within the medical curriculum, specifically medical neuroscience. This research compared the performances of Claude 3.5 Sonnet (Anthropic), GPT-3.

View Article and Find Full Text PDF

AI contextual information shapes moral and aesthetic judgments of AI-generated visual art.

Cognition

January 2025

Social Brain Sciences Group, Department of Humanities, Social and Political Sciences, ETH Zurich, Zurich, Switzerland. Electronic address:

Throughout history, art creation has been regarded as a uniquely human means to express original ideas, emotions, and experiences. However, as Generative Artificial Intelligence reshapes visual, aesthetic, legal, and economic culture, critical questions arise about the moral and aesthetic implications of AI-generated art. Despite the growing use of AI tools in art, the moral impact of AI involvement in the art creation process remains underexplored.

View Article and Find Full Text PDF

Assessing the Current Limitations of Large Language Models in Advancing Health Care Education.

JMIR Form Res

January 2025

Department of Physician Assistant Studies, Massachusetts College of Pharmacy and Health Sciences, 179 Longwood Avenue, Boston, MA, 02115, United States, 1 6177322961.

The integration of large language models (LLMs), as seen with the generative pretrained transformers series, into health care education and clinical management represents a transformative potential. The practical use of current LLMs in health care sparks great anticipation for new avenues, yet its embracement also elicits considerable concerns that necessitate careful deliberation. This study aims to evaluate the application of state-of-the-art LLMs in health care education, highlighting the following shortcomings as areas requiring significant and urgent improvements: (1) threats to academic integrity, (2) dissemination of misinformation and risks of automation bias, (3) challenges with information completeness and consistency, (4) inequity of access, (5) risks of algorithmic bias, (6) exhibition of moral instability, (7) technological limitations in plugin tools, and (8) lack of regulatory oversight in addressing legal and ethical challenges.

View Article and Find Full Text PDF

There is a growing importance for patients to easily access information regarding their medical conditions to improve their understanding and participation in health care decisions. Artificial Intelligence (AI) has proven as a fast, efficient, and effective tool in educating patients regarding their health care conditions. The aim of the study is to compare the responses provided by AI tools, ChatGPT and Google Gemini, to assess for conciseness and understandability of information provided for the medical conditions Deep vein thrombosis, decubitus ulcers, and hemorrhoids.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!