Purpose: In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT.
Method: As with earlier investigations, analyses were based on publicly available USMLE sample items.
Recent advances in automated scoring technology have made it practical to replace multiple-choice questions (MCQs) with short-answer questions (SAQs) in large-scale, high-stakes assessments. However, most previous research comparing these formats has used small examinee samples testing under low-stakes conditions. Additionally, previous studies have not reported on the time required to respond to the two item types.
View Article and Find Full Text PDFAdv Health Sci Educ Theory Pract
December 2022
Understanding the response process used by test takers when responding to multiple-choice questions (MCQs) is particularly important in evaluating the validity of score interpretations. Previous authors have recommended eye-tracking technology as a useful approach for collecting data on the processes test taker's use to respond to test questions. This study proposes a new method for evaluating alternative score interpretations by using eye-tracking data and machine learning.
View Article and Find Full Text PDFOne of the most challenging aspects of writing multiple-choice test questions is identifying plausible incorrect response options-i.e., distractors.
View Article and Find Full Text PDFBackground: Examinees often believe that changing answers will lower their scores; however, empirical studies suggest that allowing examinees to change responses may improve their performance in classroom assessments. To date, no studies have been able to examine answer changes during large scale professional credentialing or licensing examinations.
Methods: In this study, we expand the research on answer changes by analyzing responses from 27,830 examinees who completed the Step 2 Clinical Knowledge (CK) examination between August of 2015 and August of 2016.
Background: Increased recognition of the importance of competency-based education and assessment has led to the need for practical and reliable methods to assess relevant skills in the workplace.
Methods: A novel milestone-based workplace assessment system was implemented in 15 pediatrics residency programs. The system provided: (1) web-based multisource feedback (MSF) and structured clinical observation (SCO) instruments that could be completed on any computer or mobile device; and (2) monthly feedback reports that included competency-level scores and recommendations for improvement.
Purpose: Physicians must pass the United States Medical Licensing Examination (USMLE) to obtain an unrestricted license to practice allopathic medicine in the United States. Little is known, however, about how well USMLE performance relates to physician behavior in practice, particularly conduct inconsistent with safe, effective patient care. The authors examined the extent to which USMLE scores relate to the odds of receiving a disciplinary action from a U.
View Article and Find Full Text PDFPurpose: To add to the small body of validity research addressing whether scores from performance assessments of clinical skills are related to performance in supervised patient settings, the authors examined relationships between United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills (CS) data gathering and data interpretation scores and subsequent performance in history taking and physical examination in internal medicine residency training.
Method: The sample included 6,306 examinees from 238 internal medicine residency programs who completed Step 2 CS for the first time in 2005 and whose performance ratings from their first year of residency training were available. Hierarchical linear modeling techniques were used to examine the relationships among Step 2 CS data gathering and data interpretation scores and history-taking and physical examination ratings.
Purpose: This study extends available evidence about the relationship between scores on the Step 2 Clinical Skills (CS) component of the United States Medical Licensing Examination and subsequent performance in residency. It focuses on the relationship between Step 2 CS communication and interpersonal skills scores and communication skills ratings that residency directors assign to residents in their first postgraduate year of internal medicine training. It represents the first large-scale evaluation of the extent to which Step 2 CS communication and interpersonal skills scores can be extrapolated to examinee performance in supervised practice.
View Article and Find Full Text PDFPurpose: This research examined the credibility of the cut scores used to make pass/fail decisions on United States Medical Licensing Examination (USMLE) Step 1, Step 2 Clinical Knowledge, and Step 3.
Method: Approximately 15,000 members of nine constituency groups were asked their opinions about (1) current initial and ultimate fail rates and (2) the highest acceptable, lowest acceptable, and optimal initial and ultimate fail rates.
Results: Initial fail rates were generally viewed as appropriate; more variability was associated with ultimate fail rates.
Adv Health Sci Educ Theory Pract
December 2010
In recent years, demand for performance assessments has continued to grow. However, performance assessments are notorious for lower reliability, and in particular, low reliability resulting from task specificity. Since reliability analyses typically treat the performance tasks as randomly sampled from an infinite universe of tasks, these estimates of reliability may not be accurate.
View Article and Find Full Text PDFAdv Health Sci Educ Theory Pract
October 2010
The use of standardized patients to assess communication skills is now an essential part of assessing a physician's readiness for practice. To improve the reliability of communication scores, it has become increasingly common in recent years to use statistical models to adjust ratings provided by standardized patients. This study employed ordinary least squares regression to adjust ratings, and then used generalizability theory to evaluate the impact of these adjustments on score reliability and the overall standard error of measurement.
View Article and Find Full Text PDFBackground: Documentation is a subcomponent of the Step 2 Clinical Skills Examination Integrated Clinical Encounter (ICE) component wherein licensed physicians rate examinees on their ability to communicate the findings of the patient encounter, diagnostic impression, and initial patient work-up. The main purpose of this research was to examine the impact of modifications to the scoring rubric and rater training protocol on the psychometric characteristics of the documentation scores.
Method: Following the modifications, the variance structure of the ICE components was modeled using multivariate generalizability theory.
Background: In clinical skills, closely related skills are often combined to form a composite score. For example, history-taking and physical examination scores are typically combined. Interestingly, there is relatively little research to support this practice.
View Article and Find Full Text PDFBackground: Previous research has shown that ratings of English proficiency on the United States Medical Licensing Examination Clinical Skills Examination are highly reliable. However, the score distributions for native and nonnative speakers of English are sufficiently different to suggest that reliability should be investigated separately for each group.
Method: Generalizability theory was used to obtain reliability indices separately for native and nonnative speakers of English (N = 29,084).
Background: The 2000 Institute of Medicine report on patient safety brought renewed attention to the issue of preventable medical errors, and subsequently specialty boards and the National Board of Medical Examiners were encouraged to play a role in setting expectations around safety education. This paper examines potentially dangerous actions taken by examinees during the portion of the United States Medical Licensing Examination Step 3 that is particularly well suited to evaluating lapses in physician decision making, the Computer-based Case Simulation (CCS).
Method: Descriptive statistics and a general linear modeling approach were used to analyze dangerous actions ordered by 25,283 examinees that completed CCS for the first time between November 2006 and January 2008.
Adv Health Sci Educ Theory Pract
May 2012
During the last decade, interest in assessing professionalism in medical education has increased exponentially and has led to the development of many new assessment tools. Efforts to validate the scores produced by tools designed to assess professionalism have lagged well behind the development of these tools. This paper provides a structured framework for collecting evidence to support the validity of assessments of professionalism.
View Article and Find Full Text PDFMedical professionalism is increasingly recognized as a core competence of medical trainees and practitioners. Although the general and specific domains of professionalism are thoroughly characterized, procedures for assessing them are not well-developed. This article outlines an approach to designing and implementing an assessment program for medical professionalism that begins and ends with asking and answering a series of critical questions about the purpose and nature of the program.
View Article and Find Full Text PDFTo obtain a full and unrestricted license to practice medicine in the United States, students and graduates of the MD-granting US medical schools and of medical schools located outside of the United States must take and pass the United States Medical Licensing Examination. United States Medical Licensing Examination began as a series of paper-and-pencil examinations in the early 1990s and converted to computer-delivery in 1999. With this change to the computerized format came the opportunity to introduce computer-simulated patients, which had been under development at the National Board of Medical Examiners for a number of years.
View Article and Find Full Text PDFBackground: This study investigated whether participants' subjective reports of how they assigned ratings on a multisource feedback instrument provide evidence to support interpreting the resulting scores as objective, accurate measures of professional behavior.
Method: Twenty-six participants completed think-aloud interviews while rating students, residents, or faculty members they had worked with previously. The items rated included 15 behavioral items and one global item.
Background: Checklist scores used to produce the data gathering score on the Step 2 CS examination are currently weighted using an algorithm based on expert judgment about the importance of the item. The present research was designed to compare this approach with alternative weighting strategies.
Method: Scores from 21,140 examinees who took the United States Medical Licensing Examination Step 2 between May 2006 and February 2007 were subjected to five weighting models: (1) a regression weights model, (2) a factor loading weights model, (3) a standardized response model, (4) an equal weights model, and (5) the operational expert-judgment weights model.
Background: To examine the effects of (1) examinee gender on United States Medical Licensing Examination (USMLE) Step 1 performance, (2) examinee gender on the relationships between prematriculation measures and Step 1 performance, and (3) medical school characteristics on the relationships between examinee characteristics and Step 1 performance.
Method: A series of hierarchical linear models (examinees-nested-in-schools) was conducted predicting Step 1 scores. The sample included 66,412 examinees from 133 U.
Background: As with any examination using human raters, it is possible that human subjectivity may introduce measurement error. An examinee's performance might be scored differently on the basis of the quality of the preceding performance(s) (contrast effects). This research investigated the presence of contrast effects, within and across test sessions, for the communication and interpersonal skills component of the United States Medical Licensing Examination Step 2 Clinical Skills (CS) examination.
View Article and Find Full Text PDFBackground: This research examined various sources of measurement error in the documentation score component of the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills examination.
Method: A generalizability theory framework was employed to examine the documentation ratings for 847 examinees who completed the USMLE Step 2 Clinical Skills examination during an eight-day period in 2006. Each patient note was scored by two different raters allowing for a persons-crossed-with-raters-nested-in-cases design.