Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program.

Leo Morjaria Levi Burns Keyna Bracken Quang N Ngo Mark Lee Anthony J Levinson John Smith Penelope Thompson Matthew Sibbald

J Med Educ Curric Dev

Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada.

Published: September 2023

Objectives: ChatGPT is an artificial intelligence model that can interpret free-text prompts and return detailed, human-like responses across a wide domain of subjects. This study evaluated the extent of the threat posed by ChatGPT to the validity of short-answer assessment problems used to examine pre-clerkship medical students in our undergraduate medical education program.

Methods: Forty problems used in prior student assessments were retrieved and stratified by levels of Bloom's Taxonomy. Thirty of these problems were submitted to ChatGPT-3.5. For the remaining 10 problems, we retrieved past minimally passing student responses. Six tutors graded each of the 40 responses. Comparison of performance between student-generated and ChatGPT-generated answers aggregated as a whole and grouped by Bloom's levels of cognitive reasoning, was done using t-tests, ANOVA, Cronbach's alpha, and Cohen's d. Scores for ChatGPT-generated responses were also compared to historical class average performance.

Results: ChatGPT-generated responses received a mean score of 3.29 out of 5 (n = 30, 95% CI 2.93-3.65) compared to 2.38 for a group of students meeting minimum passing marks (n = 10, 95% CI 1.94-2.82), representing higher performance ( = .008, η = 0.169), but was outperformed by historical class average scores on the same 30 problems (mean 3.67, = .018) when including all past responses regardless of student performance level. There was no statistically significant trend in performance across domains of Bloom's Taxonomy.

Conclusion: While ChatGPT was able to pass short answer assessment problems spanning the pre-clerkship curriculum, it outperformed only underperforming students. We remark that tutors in several cases were convinced that ChatGPT-produced responses were produced by students. Risks to assessment validity include uncertainty in identifying struggling students and inability to intervene in a timely manner. The performance of ChatGPT on problems requiring increasing demands of cognitive reasoning warrants further research.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10540597	PMC
http://dx.doi.org/10.1177/23821205231204178	DOI Listing

Publication Analysis

Top Keywords

chatgpt validity

short answer

undergraduate medical

assessment problems

cognitive reasoning

chatgpt-generated responses

historical class

class average

responses

problems

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!