Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study.

JMIR Med Educ

Unidad de Investigación Para la Generación y Síntesis de Evidencias en Salud, Vicerrectorado de Investigación, Universidad San Ignacio de Loyola, Lima, Peru.

Published: September 2023

Background: ChatGPT has shown impressive performance in national medical licensing examinations, such as the United States Medical Licensing Examination (USMLE), even passing it with expert-level performance. However, there is a lack of research on its performance in low-income countries' national licensing medical examinations. In Peru, where almost one out of three examinees fails the national licensing medical examination, ChatGPT has the potential to enhance medical education.

Objective: We aimed to assess the accuracy of ChatGPT using GPT-3.5 and GPT-4 on the Peruvian National Licensing Medical Examination (Examen Nacional de Medicina [ENAM]). Additionally, we sought to identify factors associated with incorrect answers provided by ChatGPT.

Methods: We used the ENAM 2022 data set, which consisted of 180 multiple-choice questions, to evaluate the performance of ChatGPT. Various prompts were used, and accuracy was evaluated. The performance of ChatGPT was compared to that of a sample of 1025 examinees. Factors such as question type, Peruvian-specific knowledge, discrimination, difficulty, quality of questions, and subject were analyzed to determine their influence on incorrect answers. Questions that received incorrect answers underwent a three-step process involving different prompts to explore the potential impact of adding roles and context on ChatGPT's accuracy.

Results: GPT-4 achieved an accuracy of 86% on the ENAM, followed by GPT-3.5 with 77%. The accuracy obtained by the 1025 examinees was 55%. There was a fair agreement (κ=0.38) between GPT-3.5 and GPT-4. Moderate-to-high-difficulty questions were associated with incorrect answers in the crude and adjusted model for GPT-3.5 (odds ratio [OR] 6.6, 95% CI 2.73-15.95) and GPT-4 (OR 33.23, 95% CI 4.3-257.12). After reinputting questions that received incorrect answers, GPT-3.5 went from 41 (100%) to 12 (29%) incorrect answers, and GPT-4 from 25 (100%) to 4 (16%).

Conclusions: Our study found that ChatGPT (GPT-3.5 and GPT-4) can achieve expert-level performance on the ENAM, outperforming most of our examinees. We found fair agreement between both GPT-3.5 and GPT-4. Incorrect answers were associated with the difficulty of questions, which may resemble human performance. Furthermore, by reinputting questions that initially received incorrect answers with different prompts containing additional roles and context, ChatGPT achieved improved accuracy.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10570896PMC
http://dx.doi.org/10.2196/48039DOI Listing

Publication Analysis

Top Keywords

incorrect answers
32
national licensing
16
licensing medical
16
gpt-35 gpt-4
16
performance chatgpt
12
medical examination
12
received incorrect
12
performance
8
peruvian national
8
medical licensing
8

Similar Publications

Current applications and challenges in large language models for patient care: a systematic review.

Commun Med (Lond)

January 2025

School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany.

Background: The introduction of large language models (LLMs) into clinical practice promises to improve patient education and empowerment, thereby personalizing medical care and broadening access to medical knowledge. Despite the popularity of LLMs, there is a significant gap in systematized information on their use in patient care. Therefore, this systematic review aims to synthesize current applications and limitations of LLMs in patient care.

View Article and Find Full Text PDF

Causal Inference With Observational Data and Unobserved Confounding Variables.

Ecol Lett

January 2025

Department of Ecology and Evolutionary Biology, University of Colorado Boulder, Boulder, Colorado, USA.

Experiments have long been the gold standard for causal inference in Ecology. As Ecology tackles progressively larger problems, however, we are moving beyond the scales at which randomised controlled experiments are feasible. To answer causal questions at scale, we need to also use observational data -something Ecologists tend to view with great scepticism.

View Article and Find Full Text PDF

ChatGPT and oral cancer: a study on informational reliability.

BMC Oral Health

January 2025

Faculty of Dentistry, Department of Dentomaxillofacial Radiology, Tokat Gaziosmanpasa University, Tokat, Turkey.

Background: Artificial intelligence (AI) and large language models (LLMs) like ChatGPT have transformed information retrieval, including in healthcare. ChatGPT, trained on diverse datasets, can provide medical advice but faces ethical and accuracy concerns. This study evaluates the accuracy of ChatGPT-3.

View Article and Find Full Text PDF

Assessing the possibility of using large language models in ocular surface diseases.

Int J Ophthalmol

January 2025

Department of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, National Clinical Research Center for Eye Diseases, Shanghai 200080, China.

Aim: To assess the possibility of using different large language models (LLMs) in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surface diseases: ChatGPT-4, ChatGPT-3.5, Claude 2, PaLM2, and SenseNova.

Methods: A group of experienced ophthalmology professors were asked to develop a 100-question single-choice question on ocular surface diseases designed to assess the performance of LLMs and human participants in answering ophthalmology specialty exam questions.

View Article and Find Full Text PDF

Introduction: Physicians are expected to be competent in the management of cardiovascular emergencies. Despite the demand, there is a lack of research regarding how to better provide training for medical students to address cardiovascular emergencies. The authors of this project hypothesize that medical students participating in the Advanced Cardiac Life Support Instructors (ACLS-I) program will improve their emergency management and clinical teaching competencies and confidence.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!