Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine.

Francisco Guillen-Grima Sara Guillen-Aguinaga Laura Guillen-Aguinaga Rosa Alas-Brun Luc Onambele Wilfrido Ortega Rocio Montejo Enrique Aguinaga-Ontoso Paul Barach Ines Aguinaga-Ontoso

Clin Pract

Department of Health Sciences, Public University of Navarra, 31008 Pamplona, Spain.

Published: November 2023

Unlabelled: The rapid progress in artificial intelligence, machine learning, and natural language processing has led to increasingly sophisticated large language models (LLMs) for use in healthcare. This study assesses the performance of two LLMs, the GPT-3.5 and GPT-4 models, in passing the MIR medical examination for access to medical specialist training in Spain. Our objectives included gauging the model's overall performance, analyzing discrepancies across different medical specialties, discerning between theoretical and practical questions, estimating error proportions, and assessing the hypothetical severity of errors committed by a physician.

Material And Methods: We studied the 2022 Spanish MIR examination results after excluding those questions requiring image evaluations or having acknowledged errors. The remaining 182 questions were presented to the LLM GPT-4 and GPT-3.5 in Spanish and English. Logistic regression models analyzed the relationships between question length, sequence, and performance. We also analyzed the 23 questions with images, using GPT-4's new image analysis capability.

Results: GPT-4 outperformed GPT-3.5, scoring 86.81% in Spanish ( < 0.001). English translations had a slightly enhanced performance. GPT-4 scored 26.1% of the questions with images in English. The results were worse when the questions were in Spanish, 13.0%, although the differences were not statistically significant ( = 0.250). Among medical specialties, GPT-4 achieved a 100% correct response rate in several areas, and the Pharmacology, Critical Care, and Infectious Diseases specialties showed lower performance. The error analysis revealed that while a 13.2% error rate existed, the gravest categories, such as "error requiring intervention to sustain life" and "error resulting in death", had a 0% rate.

Conclusions: GPT-4 performs robustly on the Spanish MIR examination, with varying capabilities to discriminate knowledge across specialties. While the model's high success rate is commendable, understanding the error severity is critical, especially when considering AI's potential role in real-world medical practice and its implications for patient safety.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10660543	PMC
http://dx.doi.org/10.3390/clinpract13060130	DOI Listing

Publication Analysis

Top Keywords

medical specialties

spanish mir

mir examination

questions images

spanish

medical

gpt-4

questions

performance

evaluating efficacy

Similar Publications

Nicotine enhances the stemness and tumorigenicity in intestinal stem cells via Hippo-YAP/TAZ and Notch signal pathway.

Elife

January 2025

Department of Diabetes and Metabolic Diseases, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.

Ryosuke Isotani Masaki Igarashi Masaomi Miura Kyoko Naruse Satoshi Kuranami

Cigarette smoking is a well-known risk factor inducing the development and progression of various diseases. Nicotine (NIC) is the major constituent of cigarette smoke. However, knowledge of the mechanism underlying the NIC-regulated stem cell functions is limited.

View Article and Find Full Text PDF

Similar Publications

ChatGPT's Attitude, Knowledge, and Clinical Application in Geriatrics Practice and Education: Exploratory Observational Study.

JMIR Form Res

January 2025

Minneapolis VA Health Care System, Minneapolis, MN, United States.

Huai Yong Cheng

Background: The increasing use of ChatGPT in clinical practice and medical education necessitates the evaluation of its reliability, particularly in geriatrics.

Objective: This study aimed to evaluate ChatGPT's trustworthiness in geriatrics through 3 distinct approaches: evaluating ChatGPT's geriatrics attitude, knowledge, and clinical application with 2 vignettes of geriatric syndromes (polypharmacy and falls).

Methods: We used the validated University of California, Los Angeles, geriatrics attitude and knowledge instruments to evaluate ChatGPT's geriatrics attitude and knowledge and compare its performance with that of medical students, residents, and geriatrics fellows from reported results in the literature.

View Article and Find Full Text PDF

Similar Publications

A European Survey to Identify Challenges in the Management of Metabolic Dysfunction-Associated Steatotic Liver Disease.

Liver Int

February 2025

City University of New York Graduate School for Public Health and Health Policy (CUNY SPH), New York, New York, USA.

Laurent Castera William Alazawi Elisabetta Bugianesi Cyrielle Caussy Massimo Federici

Background And Aims: Metabolic dysfunction-associated steatotic liver disease (MASLD) and its more severe subtype, metabolic dysfunction-associated steatohepatitis (MASH), are highly prevalent and strongly associated with obesity and type 2 diabetes (T2D). This study sought to identify challenges to the diagnosis, treatment and management of people living with MASLD and MASH and understand the key barriers to adopting relevant clinical guidelines.

Methods: A real-world, cross-sectional study (BARRIERS-MASLD) consisting of a quantitative survey and qualitative interviews of physicians in France, Germany, Italy, Spain and the United Kingdom was conducted from March to September 2023.

View Article and Find Full Text PDF

Similar Publications

RASGRP1 Deficiency Associated with Diffuse Mesangial Sclerosis Infantile Nephrotic Syndrome and Epstein-Barr Virus-Induced Hodgkin's Lymphoma.

Pediatr Allergy Immunol Pulmonol

January 2025

Clinical Immunology Unit, Faculty of Medicine and Health Sciences, Department of Paediatrics, Universiti Putra Malaysia, Selangor, Malaysia.

Khairoon Nisa Mohamed Nashrudin Mohd Azri Zainal Abidin Shi Eng Ng Hadibiah Razali Vida Jawin

: RAS guanyl-releasing protein 1 (RASGRP1) deficiency is characterized by immune dysregulation and Epstein-Barr virus (EBV)-related lymphoproliferation. Diffuse mesangial sclerosis is one of the infrequent causes of infantile nephrotic syndrome. : Here, we described a 7-year-old girl who was diagnosed with diffuse mesangial sclerosis at 5 months old and subsequently developed chronic bilateral neck swelling at the age of 3 years.

View Article and Find Full Text PDF

Similar Publications

Good and bad indications for adjuvant radiotherapy after transoral laser microsurgery for laryngeal cancer.

Curr Opin Otolaryngol Head Neck Surg

December 2024

Otorhinolaryngology Department. Hospital Clínic.

Claudio Sampieri Laura Ruiz-Sevilla Isabel Vilaseca

Purpose Of Review: To summarize current evidence regarding the indication of adjuvant treatment after transoral laser microsurgery (TOLMS).

Recent Findings: Apart from well known risk factors, margins represent the key point in the decision-making. If margins are affected, additional treatment is mandatory.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!