We examine whether a leading AI system, GPT-4, understands text as well as humans do, first using a well-established standardized test of discourse comprehension. On this test, GPT-4 performs slightly, but not statistically significantly, better than humans given the very high level of human performance. Both GPT-4 and humans make correct inferences about information that is not explicitly stated in the text, a critical test of understanding. Next, we use more difficult passages to determine whether that could allow larger differences between GPT-4 and humans. GPT-4 does considerably better on this more difficult text than do the high school and university students for whom these the text passages are designed, as admission tests of student reading comprehension. Deeper exploration of GPT-4's performance on material from one of these admission tests reveals generally accepted signatures of genuine understanding, namely generalization and inference.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11840437 | PMC |
http://dx.doi.org/10.1098/rsos.241313 | DOI Listing |
Rheumatol Int
February 2025
Section for Rheumatology, Oslo University Hospital, Oslo, Norway.
To compare visceral adipose tissue (VAT) mass, lipid profile, and selected adipokines/cytokines in patients with juvenile idiopathic arthritis (JIA) with controls, and to explore associations between these markers and VAT. We included 60 JIA patients (30 oligoarticular,30 polyarticular), aged 10-16 years, and 60 age-and sex-matched controls. VAT (g) was estimated by dual-energy x-ray absorptiometry.
View Article and Find Full Text PDFJ Med Internet Res
March 2025
Information Services, ECU Health, Greenville, NC, United States.
Background: Racial and ethnic bias in large language models (LLMs) used for health care tasks is a growing concern, as it may contribute to health disparities. In response, LLM operators implemented safeguards against prompts that are overtly seeking certain biases.
Objective: This study aims to investigate a potential racial and ethnic bias among 4 popular LLMs: GPT-3.
BMC Emerg Med
March 2025
Department of Emergency Medicine, Kocaeli City Hospital, Kocaeli, Turkey.
Objective: This study evaluates the potential use of ChatGPT in aiding clinical decision-making for patients with mild traumatic brain injury (TBI) by assessing the quality of responses it generates for clinical care.
Methods: Seventeen mild TBI case scenarios were selected from PubMed Central, and each case was analyzed by GPT-4 (March 21, 2024, version) between April 11 and April 20, 2024. Responses were evaluated by four emergency medicine specialists, who rated the ease of understanding, scientific adequacy, and satisfaction with each response using a 7-point Likert scale.
JMIR Med Inform
March 2025
Department of Emergency and Critical Care Medicine, Chiba University Graduate School of Medicine, 1-8-1 Inohana, Chuo, Chiba, 260-8677, Japan, 81 432262372.
This study demonstrated that while GPT-4 Turbo had superior specificity when compared to GPT-3.5 Turbo (0.98 vs 0.
View Article and Find Full Text PDFAm Surg
March 2025
Department of Surgery, Sapienza University of Rome, Rome, Italy.
BackgroundLarge language models (LLMs) are advanced tools capable of understanding and generating human-like text. This study evaluated the accuracy of several commercial LLMs in addressing clinical questions related to diagnosis and management of acute cholecystitis, as outlined in the Tokyo Guidelines 2018 (TG18). We assessed their congruence with the expert panel discussions presented in the guidelines.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!