Unlabelled: The rapid progress in artificial intelligence, machine learning, and natural language processing has led to increasingly sophisticated large language models (LLMs) for use in healthcare. This study assesses the performance of two LLMs, the GPT-3.5 and GPT-4 models, in passing the MIR medical examination for access to medical specialist training in Spain. Our objectives included gauging the model's overall performance, analyzing discrepancies across different medical specialties, discerning between theoretical and practical questions, estimating error proportions, and assessing the hypothetical severity of errors committed by a physician.

Material And Methods: We studied the 2022 Spanish MIR examination results after excluding those questions requiring image evaluations or having acknowledged errors. The remaining 182 questions were presented to the LLM GPT-4 and GPT-3.5 in Spanish and English. Logistic regression models analyzed the relationships between question length, sequence, and performance. We also analyzed the 23 questions with images, using GPT-4's new image analysis capability.

Results: GPT-4 outperformed GPT-3.5, scoring 86.81% in Spanish ( < 0.001). English translations had a slightly enhanced performance. GPT-4 scored 26.1% of the questions with images in English. The results were worse when the questions were in Spanish, 13.0%, although the differences were not statistically significant ( = 0.250). Among medical specialties, GPT-4 achieved a 100% correct response rate in several areas, and the Pharmacology, Critical Care, and Infectious Diseases specialties showed lower performance. The error analysis revealed that while a 13.2% error rate existed, the gravest categories, such as "error requiring intervention to sustain life" and "error resulting in death", had a 0% rate.

Conclusions: GPT-4 performs robustly on the Spanish MIR examination, with varying capabilities to discriminate knowledge across specialties. While the model's high success rate is commendable, understanding the error severity is critical, especially when considering AI's potential role in real-world medical practice and its implications for patient safety.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10660543PMC
http://dx.doi.org/10.3390/clinpract13060130DOI Listing

Publication Analysis

Top Keywords

medical specialties
8
spanish mir
8
mir examination
8
questions images
8
spanish
6
medical
6
gpt-4
6
questions
6
performance
5
evaluating efficacy
4

Similar Publications

Cigarette smoking is a well-known risk factor inducing the development and progression of various diseases. Nicotine (NIC) is the major constituent of cigarette smoke. However, knowledge of the mechanism underlying the NIC-regulated stem cell functions is limited.

View Article and Find Full Text PDF

ChatGPT's Attitude, Knowledge, and Clinical Application in Geriatrics Practice and Education: Exploratory Observational Study.

JMIR Form Res

January 2025

Minneapolis VA Health Care System, Minneapolis, MN, United States.

Background: The increasing use of ChatGPT in clinical practice and medical education necessitates the evaluation of its reliability, particularly in geriatrics.

Objective: This study aimed to evaluate ChatGPT's trustworthiness in geriatrics through 3 distinct approaches: evaluating ChatGPT's geriatrics attitude, knowledge, and clinical application with 2 vignettes of geriatric syndromes (polypharmacy and falls).

Methods: We used the validated University of California, Los Angeles, geriatrics attitude and knowledge instruments to evaluate ChatGPT's geriatrics attitude and knowledge and compare its performance with that of medical students, residents, and geriatrics fellows from reported results in the literature.

View Article and Find Full Text PDF

Background And Aims: Metabolic dysfunction-associated steatotic liver disease (MASLD) and its more severe subtype, metabolic dysfunction-associated steatohepatitis (MASH), are highly prevalent and strongly associated with obesity and type 2 diabetes (T2D). This study sought to identify challenges to the diagnosis, treatment and management of people living with MASLD and MASH and understand the key barriers to adopting relevant clinical guidelines.

Methods: A real-world, cross-sectional study (BARRIERS-MASLD) consisting of a quantitative survey and qualitative interviews of physicians in France, Germany, Italy, Spain and the United Kingdom was conducted from March to September 2023.

View Article and Find Full Text PDF

: RAS guanyl-releasing protein 1 (RASGRP1) deficiency is characterized by immune dysregulation and Epstein-Barr virus (EBV)-related lymphoproliferation. Diffuse mesangial sclerosis is one of the infrequent causes of infantile nephrotic syndrome. : Here, we described a 7-year-old girl who was diagnosed with diffuse mesangial sclerosis at 5 months old and subsequently developed chronic bilateral neck swelling at the age of 3 years.

View Article and Find Full Text PDF

Purpose Of Review: To summarize current evidence regarding the indication of adjuvant treatment after transoral laser microsurgery (TOLMS).

Recent Findings: Apart from well known risk factors, margins represent the key point in the decision-making. If margins are affected, additional treatment is mandatory.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!