We examine whether a leading AI system, GPT-4, understands text as well as humans do, first using a well-established standardized test of discourse comprehension. On this test, GPT-4 performs slightly, but not statistically significantly, better than humans given the very high level of human performance. Both GPT-4 and humans make correct inferences about information that is not explicitly stated in the text, a critical test of understanding. Next, we use more difficult passages to determine whether that could allow larger differences between GPT-4 and humans. GPT-4 does considerably better on this more difficult text than do the high school and university students for whom these the text passages are designed, as admission tests of student reading comprehension. Deeper exploration of GPT-4's performance on material from one of these admission tests reveals generally accepted signatures of genuine understanding, namely generalization and inference.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11840437PMC
http://dx.doi.org/10.1098/rsos.241313DOI Listing

Publication Analysis

Top Keywords

gpt-4 humans
8
admission tests
8
gpt-4
6
text
5
humans
5
text understanding
4
understanding gpt-4
4
gpt-4 versus
4
versus humans
4
humans examine
4

Similar Publications

To compare visceral adipose tissue (VAT) mass, lipid profile, and selected adipokines/cytokines in patients with juvenile idiopathic arthritis (JIA) with controls, and to explore associations between these markers and VAT. We included 60 JIA patients (30 oligoarticular,30 polyarticular), aged 10-16 years, and 60 age-and sex-matched controls. VAT (g) was estimated by dual-energy x-ray absorptiometry.

View Article and Find Full Text PDF

Background: Racial and ethnic bias in large language models (LLMs) used for health care tasks is a growing concern, as it may contribute to health disparities. In response, LLM operators implemented safeguards against prompts that are overtly seeking certain biases.

Objective: This study aims to investigate a potential racial and ethnic bias among 4 popular LLMs: GPT-3.

View Article and Find Full Text PDF

Objective: This study evaluates the potential use of ChatGPT in aiding clinical decision-making for patients with mild traumatic brain injury (TBI) by assessing the quality of responses it generates for clinical care.

Methods: Seventeen mild TBI case scenarios were selected from PubMed Central, and each case was analyzed by GPT-4 (March 21, 2024, version) between April 11 and April 20, 2024. Responses were evaluated by four emergency medicine specialists, who rated the ease of understanding, scientific adequacy, and satisfaction with each response using a 7-point Likert scale.

View Article and Find Full Text PDF

GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews.

JMIR Med Inform

March 2025

Department of Emergency and Critical Care Medicine, Chiba University Graduate School of Medicine, 1-8-1 Inohana, Chuo, Chiba, 260-8677, Japan, 81 432262372.

This study demonstrated that while GPT-4 Turbo had superior specificity when compared to GPT-3.5 Turbo (0.98 vs 0.

View Article and Find Full Text PDF

BackgroundLarge language models (LLMs) are advanced tools capable of understanding and generating human-like text. This study evaluated the accuracy of several commercial LLMs in addressing clinical questions related to diagnosis and management of acute cholecystitis, as outlined in the Tokyo Guidelines 2018 (TG18). We assessed their congruence with the expert panel discussions presented in the guidelines.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!