Introduction: Assessments in medical education play a central role in evaluating trainees' progress and eventual competence. Generative artificial intelligence is finding an increasing role in clinical care and medical education. The objective of this study was to evaluate the ability of the large language model ChatGPT to generate examination questions that are discriminating in the evaluation of graduating urology residents.

Methods: Graduating urology residents representing all Canadian training programs gather yearly for a mock examination that simulates their upcoming board certification examination. The examination consists of a written multiple-choice question (MCQ) examination and an oral objective structured clinical examination. In 2023, ChatGPT Version 4 was used to generate 20 MCQs that were added to the written component. ChatGPT was asked to use Campbell-Walsh Urology, AUA, and Canadian Urological Association guidelines as resources. Psychometric analysis of the ChatGPT MCQs was conducted. The MCQs were also researched by 3 faculty for face validity and to ascertain whether they came from a valid source.

Results: The mean score of the 35 examination takers on the ChatGPT MCQs was 60.7% vs 61.1% for the overall examination. Twenty-five of ChatGPT MCQs showed a discriminating index >0.3, the threshold for questions that properly discriminate between high and low examination performers. Twenty-five percent of ChatGPT MCQs showed a point biserial >0.2, which is considered a high correlation with overall performance on the examination. The assessment by faculty found that ChatGPT MCQs often provided incomplete information in the stem, provided multiple potentially correct answers, and were sometimes not rooted in the literature. Thirty-five percent of the MCQs generated by ChatGPT provided wrong answers to stems.

Conclusions: Despite what seems to be similar performance on ChatGPT MCQs and the overall examination, ChatGPT MCQs tend not to be highly discriminating. Poorly phrased questions with potential for artificial intelligence hallucinations are ever present. Careful vetting for quality of ChatGPT questions should be undertaken before their use on assessments in urology training examinations.

Download full-text PDF

Source
http://dx.doi.org/10.1097/JU.0000000000004357DOI Listing

Publication Analysis

Top Keywords

chatgpt mcqs
28
artificial intelligence
12
chatgpt
12
examination
11
mcqs
10
medical education
8
graduating urology
8
intelligence discriminator
4
discriminator competence
4
competence urological
4

Similar Publications

Artificial intelligence (AI) is becoming increasingly influential in ophthalmology, particularly through advancements in machine learning, deep learning, robotics, neural networks, and natural language processing (NLP). Among these, NLP-based chatbots are the most readily accessible and are driven by AI-based large language models (LLMs). These chatbots have facilitated new research avenues and have gained traction in both clinical and surgical applications in ophthalmology.

View Article and Find Full Text PDF
Article Synopsis
  • The study evaluates the effectiveness of AI chatbots ChatGPT and Bard in answering multiple choice questions (MCQs) related to Intermediate Life Support and managing cardiac arrest.
  • Both chatbots had similar performances, with Bard slightly outperforming ChatGPT, although the difference wasn't statistically significant.
  • The explanations given by both chatbots, while not always correct, still contained useful information, highlighting their potential value in medical education.
View Article and Find Full Text PDF

The article "ChatGPT Efficacy for Answering Musculoskeletal Anatomy Questions: A Study Evaluating Quality and Consistency between Raters and Timepoints" assesses the performance of ChatGPT 3.5 in answering musculoskeletal anatomy questions, highlighting variability in response quality and reproducibility. We raise several points that may add further insights into the study's findings.

View Article and Find Full Text PDF

Background And Aim: Access to quality health care is essential, particularly in remote areas where the availability of healthcare professionals may be limited. The advancement of artificial intelligence (AI) and natural language processing (NLP) has led to the development of large language models (LLMs) that exhibit capabilities in understanding and generating human-like text. This study aimed to evaluate the performance of a LLM, ChatGPT, in addressing primary healthcare issues.

View Article and Find Full Text PDF

Background Large language models (LLMs) are increasingly explored in healthcare and education. In medical education, they hold the potential to enhance learning by supporting personalized teaching, resource development, and student engagement. However, LLM use also raises concerns about ethics, accuracy, and reliance.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!