Large language models (LLMs) have broad potential applications in medicine, such as aiding with education, providing reassurance to patients, and supporting clinical decision-making. However, there is a notable gap in understanding their applicability and performance in the surgical domain and how their performance varies across specialties. This paper aims to evaluate the performance of LLMs in answering surgical questions relevant to clinical practice and to assess how this performance varies across different surgical specialties. We used the MedMCQA dataset, a large-scale multi-choice question-answer (MCQA) dataset consisting of clinical questions across all areas of medicine. We extracted the relevant 23,035 surgical questions and submitted them to the popular LLMs Generative Pre-trained Transformers (GPT)-3.5 and GPT-4 (OpenAI OpCo, LLC, San Francisco, CA). Generative Pre-trained Transformer is a large language model that can generate human-like text by predicting subsequent words in a sentence based on the context of the words that come before it. It is pre-trained on a diverse range of texts and can perform a variety of tasks, such as answering questions, without needing task-specific training. The question-answering accuracy of GPT was calculated and compared between the two models and across surgical specialties. Both GPT-3.5 and GPT-4 achieved accuracies of 53.3% and 64.4%, respectively, on surgical questions, showing a statistically significant difference in performance. When compared to their performance on the full MedMCQA dataset, the two models performed differently: GPT-4 performed worse on surgical questions than on the dataset as a whole, while GPT-3.5 showed the opposite pattern. Significant variations in accuracy were also observed across different surgical specialties, with strong performances in anatomy, vascular, and paediatric surgery and worse performances in orthopaedics, ENT, and neurosurgery. Large language models exhibit promising capabilities in addressing surgical questions, although the variability in their performance between specialties cannot be ignored. The lower performance of the latest GPT-4 model on surgical questions relative to questions across all medicine highlights the need for targeted improvements and continuous updates to ensure relevance and accuracy in surgical applications. Further research and continuous monitoring of LLM performance in surgical domains are crucial to fully harnessing their potential and mitigating the risks of misinformation.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10720372PMC
http://dx.doi.org/10.7759/cureus.48788DOI Listing

Publication Analysis

Top Keywords

surgical questions
24
surgical
12
surgical specialties
12
performance
9
questions
9
performance surgical
8
performance varies
8
medmcqa dataset
8
generative pre-trained
8
gpt-35 gpt-4
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!