Introduction With the potential for artificial intelligence (AI) chatbots to serve as the primary source of glaucoma information to patients, it is essential to characterize the information that chatbots provide such that providers can tailor discussions, anticipate patient concerns, and identify misleading information. Therefore, the purpose of this study was to evaluate glaucoma information from AI chatbots, including ChatGPT-4, Bard, and Bing, by analyzing response accuracy, comprehensiveness, readability, word count, and character count in comparison to each other and glaucoma-related American Academy of Ophthalmology (AAO) patient materials. Methods Section headers from AAO glaucoma-related patient education brochures were adapted into question form and asked five times to each AI chatbot (ChatGPT-4, Bard, and Bing). Two sets of responses from each chatbot were used to evaluate the accuracy of AI chatbot responses and AAO brochure information, and the comprehensiveness of AI chatbot responses compared to the AAO brochure information, scored 1-5 by three independent glaucoma-trained ophthalmologists. Readability (assessed with Flesch-Kincaid Grade Level (FKGL), corresponding to the United States school grade levels), word count, and character count were determined for all chatbot responses and AAO brochure sections. Results Accuracy scores for AAO, ChatGPT, Bing, and Bard were 4.84, 4.26, 4.53, and 3.53, respectively. On direct comparison, AAO was more accurate than ChatGPT (p=0.002), and Bard was the least accurate (Bard versus AAO, p<0.001; Bard versus ChatGPT, p<0.002; Bard versus Bing, p=0.001). ChatGPT had the most comprehensive responses (ChatGPT versus Bing, p<0.001; ChatGPT versus Bard p=0.008), with comprehensiveness scores for ChatGPT, Bing, and Bard at 3.32, 2.16, and 2.79, respectively. AAO information and Bard responses were at the most accessible readability levels (AAO versus ChatGPT, AAO versus Bing, Bard versus ChatGPT, Bard versus Bing, all p<0.0001), with readability levels for AAO, ChatGPT, Bing, and Bard at 8.11, 13.01, 11.73, and 7.90, respectively. Bing responses had the lowest word and character count. Conclusion AI chatbot responses varied in accuracy, comprehensiveness, and readability. With accuracy scores and comprehensiveness below that of AAO brochures and elevated readability levels, AI chatbots require improvements to be a more useful supplementary source of glaucoma information for patients. Physicians must be aware of these limitations such that patients are asked about existing knowledge and questions and are then provided with clarifying and comprehensive information.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11034394PMC
http://dx.doi.org/10.7759/cureus.56766DOI Listing

Publication Analysis

Top Keywords

chatbot responses
12
aao brochure
12
artificial intelligence
8
intelligence chatbots
8
chatgpt-4 bard
8
bard bing
8
word count
8
count character
8
character count
8
aao
8

Similar Publications

Objective: This study aims to evaluate the impact of prompt construction on the quality of artificial intelligence (AI) chatbot responses in the context of head and neck surgery.

Study Design: Observational and evaluative study.

Setting: An international collaboration involving 16 researchers from 11 European centers specializing in head and neck surgery.

View Article and Find Full Text PDF

This study evaluated the capability of three AI chatbots-ChatGPT 4.0, Claude 3.0, and Gemini Pro, as well as Google-in responding to common postkidney transplantation inquiries.

View Article and Find Full Text PDF

Introduction: The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them.

Objective: This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them.

View Article and Find Full Text PDF

Background: Artificial intelligence (AI) social chatbots represent a major advancement in merging technology with mental health, offering benefits through natural and emotional communication. Unlike task-oriented chatbots, social chatbots build relationships and provide social support, which can positively impact mental health outcomes like loneliness and social anxiety. However, the specific effects and mechanisms through which these chatbots influence mental health remain underexplored.

View Article and Find Full Text PDF

The potential of large language models (LLMs) in medical applications is significant, and Retrieval-augmented generation (RAG) can address the weaknesses of these models in terms of data transparency and scientific accuracy by incorporating current scientific knowledge into responses. In this study, RAG and GPT-4 by OpenAI were applied to develop GuideGPT, a context aware chatbot integrated with a knowledge database from 449 scientific publications designed to provide answers on the prevention, diagnosis, and treatment of medication-related osteonecrosis of the jaw (MRONJ). A comparison was made with a generic LLM ("PureGPT") across 30 MRONJ-related questions.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!