Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics.

Hossein Mohammad-Rahimi Seyed AmirHossein Ourang Mohamad Amin Pourhoseingholi Omid Dianat Paul Michael Howell Dummer Ali Nosrat

Int Endod J

Division of Endodontics, Department of Advanced Oral Sciences and Therapeutics, School of Dentistry, University of Maryland, Baltimore, Maryland, USA.

Published: March 2024

Aim: This study aimed to evaluate and compare the validity and reliability of responses provided by GPT-3.5, Google Bard, and Bing to frequently asked questions (FAQs) in the field of endodontics.

Methodology: FAQs were formulated by expert endodontists (n = 10) and collected through GPT-3.5 queries (n = 10), with every question posed to each chatbot three times. Responses (N = 180) were independently evaluated by two board-certified endodontists using a modified Global Quality Score (GQS) on a 5-point Likert scale (5: strongly agree; 4: agree; 3: neutral; 2: disagree; 1: strongly disagree). Disagreements on scoring were resolved through evidence-based discussions. The validity of responses was analysed by categorizing scores into valid or invalid at two thresholds: The low threshold was set at score ≥4 for all three responses whilst the high threshold was set at score 5 for all three responses. Fisher's exact test was conducted to compare the validity of responses between chatbots. Cronbach's alpha was calculated to assess the reliability by assessing the consistency of repeated responses for each chatbot.

Results: All three chatbots provided answers to all questions. Using the low-threshold validity test (GPT-3.5: 95%; Google Bard: 85%; Bing: 75%), there was no significant difference between the platforms (p > .05). When using the high-threshold validity test, the chatbot scores were substantially lower (GPT-3.5: 60%; Google Bard: 15%; Bing: 15%). The validity of GPT-3.5 responses was significantly higher than Google Bard and Bing (p = .008). All three chatbots achieved an acceptable level of reliability (Cronbach's alpha >0.7).

Conclusions: GPT-3.5 provided more credible information on topics related to endodontics compared to Google Bard and Bing.

Download full-text PDF	Source
http://dx.doi.org/10.1111/iej.14014	DOI Listing

Publication Analysis

Top Keywords

google bard

bard bing

validity reliability

compare validity

responses

validity responses

threshold set

set score

three responses

cronbach's alpha

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!