De novo generation of colorectal patient educational materials using large language models: Prompt engineering key to improved readability.

India E Ellison Wendelyn M Oslock Abiha Abdullah Lauren Wood Mohanraj Thirumalai Nathan English Bayley A Jones Robert Hollis Michael Rubyan Daniel I Chu

Surgery

Department of Surgery, University of Alabama at Birmingham, AL. Electronic address:

Published: January 2025

Background: Improving patient education has been shown to improve clinical outcomes and reduce disparities, though such efforts can be labor intensive. Large language models may serve as an accessible method to improve patient educational material. The aim of this study was to compare readability between existing educational materials and those generated by large language models.

Methods: Baseline colorectal surgery educational materials were gathered from a large academic institution (n = 52). Three prompts were entered into Perplexity and ChatGPT 3.5 for each topic: a Basic prompt that simply requested patient educational information the topic, an Iterative prompt that repeated instruction asking for the information to be more health literate, and a Metric-based prompt that requested a sixth-grade reading level, short sentences, and short words. Flesch-Kincaid Grade Level or Grade Level, Flesch-Kincaid Reading Ease or Ease, and Modified Grade Level scores were calculated for all materials, and unpaired t tests were used to compare mean scores between baseline and documents generated by artificial intelligence platforms.

Results: Overall existing materials were longer than materials generated by the large language models across categories and prompts: 863-956 words vs 170-265 (ChatGPT) and 220-313 (Perplexity), all P < .01. Baseline materials did not meet sixth-grade readability guidelines based on grade level (Grade Level 7.0-9.8 and Modified Grade Level 9.6-11.5) or ease of readability (Ease 53.1-65.0). Readability of materials generated by a large language model varied by prompt and platform. Overall, ChatGPT materials were more readable than baseline materials with the Metric-based prompt: Grade Level 5.2 vs 8.1, Modified Grade Level 7.3 vs 10.3, and Ease 70.5 vs 60.4, all P < .01. In contrast, Perplexity-generated materials were significantly less readable except for those generated with the Metric-based prompt, which did not statistically differ.

Conclusion: Both existing materials and the majority of educational materials created by large language models did not meet readability recommendations. The exception to this was with ChatGPT materials generated with a Metric-based prompt that consistently improved readability scores from baseline and met recommendations in terms of the average Grade Level score. The variability in performance highlights the importance of the prompt used with large language models.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.surg.2024.109024	DOI Listing

Publication Analysis

Top Keywords

grade level

large language

language models

educational materials

materials generated

metric-based prompt

materials

patient educational

generated large

modified grade

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered