Text understanding in GPT-4 versus humans.

Thomas R Shultz Jamie M Wise Ardavan S Nobandegani

R Soc Open Sci

Mila - Quebec AI Institute, Montreal, Canada.

Published: February 2025

We examine whether a leading AI system, GPT-4, understands text as well as humans do, first using a well-established standardized test of discourse comprehension. On this test, GPT-4 performs slightly, but not statistically significantly, better than humans given the very high level of human performance. Both GPT-4 and humans make correct inferences about information that is not explicitly stated in the text, a critical test of understanding. Next, we use more difficult passages to determine whether that could allow larger differences between GPT-4 and humans. GPT-4 does considerably better on this more difficult text than do the high school and university students for whom these the text passages are designed, as admission tests of student reading comprehension. Deeper exploration of GPT-4's performance on material from one of these admission tests reveals generally accepted signatures of genuine understanding, namely generalization and inference.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11840437	PMC
http://dx.doi.org/10.1098/rsos.241313	DOI Listing

Publication Analysis

Top Keywords

gpt-4 humans

admission tests

gpt-4

text

humans

text understanding

understanding gpt-4

gpt-4 versus

versus humans

humans examine

Similar Publications

Visceral adipose tissue is related to interleukin 6 and resistin in juvenile idiopathic arthritis - a case-control study.

Rheumatol Int

February 2025

Section for Rheumatology, Oslo University Hospital, Oslo, Norway.

Kristine Risum Nicoleta Cristina Olarescu Kristin Godang Henriette Schermacher Marstein Jens Bollerslev

To compare visceral adipose tissue (VAT) mass, lipid profile, and selected adipokines/cytokines in patients with juvenile idiopathic arthritis (JIA) with controls, and to explore associations between these markers and VAT. We included 60 JIA patients (30 oligoarticular,30 polyarticular), aged 10-16 years, and 60 age-and sex-matched controls. VAT (g) was estimated by dual-energy x-ray absorptiometry.

View Article and Find Full Text PDF

Similar Publications

Assessing Racial and Ethnic Bias in Text Generation by Large Language Models for Health Care-Related Tasks: Cross-Sectional Study.

J Med Internet Res

March 2025

Information Services, ECU Health, Greenville, NC, United States.

John J Hanna Abdi D Wakene Andrew O Johnson Christoph U Lehmann Richard J Medford

Background: Racial and ethnic bias in large language models (LLMs) used for health care tasks is a growing concern, as it may contribute to health disparities. In response, LLM operators implemented safeguards against prompts that are overtly seeking certain biases.

Objective: This study aims to investigate a potential racial and ethnic bias among 4 popular LLMs: GPT-3.

View Article and Find Full Text PDF

Similar Publications

AI-assisted decision-making in mild traumatic brain injury.

BMC Emerg Med

March 2025

Department of Emergency Medicine, Kocaeli City Hospital, Kocaeli, Turkey.

Yavuz Yigit Mahmut Firat Kaynak Baha Alkahlout Shabbir Ahmed Serkan Günay

Objective: This study evaluates the potential use of ChatGPT in aiding clinical decision-making for patients with mild traumatic brain injury (TBI) by assessing the quality of responses it generates for clinical care.

Methods: Seventeen mild TBI case scenarios were selected from PubMed Central, and each case was analyzed by GPT-4 (March 21, 2024, version) between April 11 and April 20, 2024. Responses were evaluated by four emergency medicine specialists, who rated the ease of understanding, scientific adequacy, and satisfaction with each response using a 7-point Likert scale.

View Article and Find Full Text PDF

Similar Publications

GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews.

JMIR Med Inform

March 2025

Department of Emergency and Critical Care Medicine, Chiba University Graduate School of Medicine, 1-8-1 Inohana, Chuo, Chiba, 260-8677, Japan, 81 432262372.

Takehiko Oami Yohei Okada Taka-Aki Nakada

This study demonstrated that while GPT-4 Turbo had superior specificity when compared to GPT-3.5 Turbo (0.98 vs 0.

View Article and Find Full Text PDF

Similar Publications

Using Large Language Models in the Diagnosis of Acute Cholecystitis: Assessing Accuracy and Guidelines Compliance.

Am Surg

March 2025

Department of Surgery, Sapienza University of Rome, Rome, Italy.

Marta Goglia Arianna Cicolani Francesco Maria Carrano Niccolò Petrucciani Francesco D'Angelo

BackgroundLarge language models (LLMs) are advanced tools capable of understanding and generating human-like text. This study evaluated the accuracy of several commercial LLMs in addressing clinical questions related to diagnosis and management of acute cholecystitis, as outlined in the Tokyo Guidelines 2018 (TG18). We assessed their congruence with the expert panel discussions presented in the guidelines.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!