Introduction Artificial intelligence (AI) models using large language models (LLMs) and non-specific domains have gained attention for their innovative information processing. As AI advances, it's essential to regularly evaluate these tools' competency to maintain high standards, prevent errors or biases, and avoid flawed reasoning or misinformation that could harm patients or spread inaccuracies. Our study aimed to determine the performance of Chat Generative Pre-trained Transformer (ChatGPT) by OpenAI and Google BARD (BARD) in orthopedic surgery, assess performance based on question types, contrast performance between different AIs and compare AI performance to orthopedic residents. Methods We administered ChatGPT and BARD 757 Orthopedic In-Training Examination (OITE) questions. After excluding image-related questions, the AIs answered 390 multiple choice questions, all categorized within 10 sub-specialties (basic science, trauma, sports medicine, spine, hip and knee, pediatrics, oncology, shoulder and elbow, hand, and food and ankle) and three taxonomy classes (recall, interpretation, and application of knowledge). Statistical analysis was performed to analyze the number of questions answered correctly by each AI model, the performance returned by each AI model within the categorized question sub-specialty designation, and the performance of each AI model in comparison to the results returned by orthopedic residents classified by their respective post-graduate year (PGY) level. Results BARD answered more overall questions correctly (58% vs 54%, p<0.001). ChatGPT performed better in sports medicine and basic science and worse in hand surgery, while BARD performed better in basic science (p<0.05). The AIs performed better in recall questions compared to the application of knowledge (p<0.05). Based on previous data, it ranked in the 42nd-96th percentile for post-graduate year ones (PGY1s), 27th-58th for PGY2s, 3rd-29th for PGY3s, 1st-21st for PGY4s, and 1st-17th for PGY5s. Discussion ChatGPT excelled in sports medicine but fell short in hand surgery, while both AIs performed well in the basic science sub-specialty but performed poorly in the application of knowledge-based taxonomy questions. BARD performed better than ChatGPT overall. Although the AI reached the second-year PGY orthopedic resident level, it fell short of passing the American Board of Orthopedic Surgery (ABOS). Its strengths in recall-based inquiries highlight its potential as an orthopedic learning and educational tool.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11014641PMC
http://dx.doi.org/10.7759/cureus.56104DOI Listing

Publication Analysis

Top Keywords

artificial intelligence
8
orthopedic residents
8
performance
6
orthopedic
5
questions
5
generative artificial
4
intelligence performs
4
performs second-year
4
second-year orthopedic
4
orthopedic resident
4

Similar Publications

Incidence of fall-from-height injuries and predictive factors for severity.

J Osteopath Med

January 2025

McAllen Department of Trauma, South Texas Health System, McAllen, TX, USA.

Context: The injuries caused by falls-from-height (FFH) are a significant public health concern. FFH is one of the most common causes of polytrauma. The injuries persist to be significant adverse events and a challenge regarding injury severity assessment to identify patients at high risk upon admission.

View Article and Find Full Text PDF

Detection of biomarkers of breast cancer incurs additional costs and tissue burden. We propose a deep learning-based algorithm (BBMIL) to predict classical biomarkers, immunotherapy-associated gene signatures, and prognosis-associated subtypes directly from hematoxylin and eosin stained histopathology images. BBMIL showed the best performance among comparative algorithms on the prediction of classical biomarkers, immunotherapy related gene signatures, and subtypes.

View Article and Find Full Text PDF

Cancer-associated fibroblasts (CAFs) are intrinsic components of the tumor microenvironment that promote cancer progression and metastasis. Through an unbiased integrated analysis of gastric tumor grade and stage, we identified a subset of proangiogenic CAFs characterized by high podoplanin (PDPN) expression, which are significantly enriched in metastatic lesions and secrete chemokine (CC-motif) ligand 2 (CCL2). Mechanistically, PDPN(+) CAFs enhance angiogenesis by activating the AKT/NF-κB signaling pathway.

View Article and Find Full Text PDF

Study Question: How accurately can artificial intelligence (AI) models predict sperm retrieval in non-obstructive azoospermia (NOA) patients undergoing micro-testicular sperm extraction (m-TESE) surgery?

Summary Answer: AI predictive models hold significant promise in predicting successful sperm retrieval in NOA patients undergoing m-TESE, although limitations regarding variability of study designs, small sample sizes, and a lack of validation studies restrict the overall generalizability of studies in this area.

What Is Known Already: Previous studies have explored various predictors of successful sperm retrieval in m-TESE, including clinical and hormonal factors. However, no consistent predictive model has yet been established.

View Article and Find Full Text PDF

Background: Infant alertness and neurologic changes can reflect life-threatening pathology but are assessed by physical exam, which can be intermittent and subjective. Reliable, continuous methods are needed. We hypothesized that our computer vision method to track movement, pose artificial intelligence (AI), could predict neurologic changes in the neonatal intensive care unit (NICU).

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!