Large language models can support generation of standardized discharge summaries - A retrospective study utilizing ChatGPT-4 and electronic health records.

Arne Schwieger Katrin Angst Mateo de Bardeci Achim Burrer Flurin Cathomas Stefano Ferrea Franziska Grätz Marius Knorr Golo Kronenberg Tobias Spiller David Troi Erich Seifritz Samantha Weber Sebastian Olbrich

Int J Med Inform

Centre for Depression, Anxiety Disorders and Psychotherapy, Psychiatric University Hospital Zurich (PUK), Zurich, Switzerland; Faculty of Medicine, University of Zurich (UZH), Zurich, Switzerland.

Published: December 2024

The study aimed to evaluate the quality of psychiatric discharge summaries (DS) written by the AI ChatGPT-4 compared to those written by psychiatric residents.
Results showed that human-written summaries were rated significantly higher than AI-generated ones, with statistical significance in most evaluated categories.
While AI DS did not fully match the quality of human-written summaries, they demonstrated potential for use as templates or starting points that could save physicians time in documentation.

Objective: To evaluate whether psychiatric discharge summaries (DS) generated with ChatGPT-4 from electronic health records (EHR) can match the quality of DS written by psychiatric residents.

Methods: At a psychiatric primary care hospital, we compared 20 inpatient DS, written by residents, to those written with ChatGPT-4 from pseudonymized residents' notes of the patients' EHRs and a standardized prompt. 8 blinded psychiatry specialists rated both versions on a custom Likert scale from 1 to 5 across 15 quality subcategories. The primary outcome was the overall rating difference between the two groups. The secondary outcomes were the rating differences at the level of individual question, case, and rater.

Results: Human-written DS were rated significantly higher than AI (mean ratings: human 3.78, AI 3.12, p < 0.05). They surpassed AI significantly in 12/15 questions and 16/20 cases and were favored significantly by 7/8 raters. For "low expected correction effort", human DS were rated as 67 % favorable, 19 % neutral, and 14 % unfavorable, whereas AI-DS were rated as 22 % favorable, 33 % neutral, and 45 % unfavorable. Hallucinations were present in 40 % of AI-DS, with 37.5 % deemed highly clinically relevant. Minor content mistakes were found in 30 % of AI and 10 % of human DS. Raters correctly identified AI-DS with 81 % sensitivity and 75 % specificity.

Discussion: Overall, AI-DS did not match the quality of resident-written DS but performed similarly in 20% of cases and were rated as favorable for "low expected correction effort" in 22% of cases. AI-DS lacked most in content specificity, ability to distill key case information, and coherence but performed adequately in conciseness, adherence to formalities, relevance of included content, and form.

Conclusion: LLM-written DS show potential as templates for physicians to finalize, potentially saving time in the future.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.ijmedinf.2024.105654	DOI Listing

Publication Analysis

Top Keywords

discharge summaries

chatgpt-4 electronic

electronic health

health records

large language

language models

models support

support generation

generation standardized

standardized discharge

Similar Publications

One-year hemodynamic and clinical outcomes in self-expanding valves: Comparison of ACURATE neo2 versus ACURATE neo.

Cardiovasc Revasc Med

December 2024

Department of Cardiology and Catheterization Laboratories, Shonan Kamakura General Hospital, Okamoto 1370-1, Kamakura City, Kanagawa 247-8533, Japan. Electronic address:

Yoichi Sugiyama Hirokazu Miyashita Sebastian Dahlbacka Tommi Vähäsilta Tiina Vainikka

Background/purpose: Transcatheter aortic valve replacement (TAVR) with ACURATE neo2 showed better hemodynamic outcomes by mitigating paravalvular leakage (PVL) compared with ACURATE neo, and revealed promising one-year outcomes in single-arm studies. However, studies comparing the hemodynamic and clinical outcomes of the two valves are still scarce. Therefore, this study aimed to compare the one-year hemodynamic and clinical outcomes between the neo2 and neo.

View Article and Find Full Text PDF

Similar Publications

A Remote Management-Centric Post-Discharge Pathway for Patients Admitted to GIM with Heart Failure.

Am J Med

December 2024

Department of Medicine, University of Toronto, Toronto, ON, Canada; HoPingKong Centre for Excellence in Education and Practice, University Health Network, Toronto, ON, Canada; Division of General Internal Medicine and Geriatrics, University Health Network, Toronto, ON, Canada.

William K Silverstein Sarah Lawrason Iris Carabuena Rodrigo B Cavalcanti Stella Kozuszko

Background: Few GIM-specific heart failure transition of care (TOC) programs exist. We thus piloted a TOC program for heart failure patients discharged from GIM that incorporates a remote patient management program, Medly.

Methods: This single-centre, prospective proof-of-concept study described sociodemographic and medical characteristics of included patients, and computed summary statistics to describe clinical and workload outcomes.

View Article and Find Full Text PDF

Similar Publications

Trend and Factors Associated with Medical-Surgical Complications in Patients Discharged from Leprosy Multidrug Therapy at the Specialized Regional Hospital in Macenta, Guinea, from 2012 to 2021.

Trop Med Infect Dis

November 2024

Centre National de Formation et de Recherche en Santé Rurale de Mafèrinyah, Forécariah GPW7+V9G, Guinea.

Jean Hébélamou Fassou Mathias Grovogui Hawa Manet Lavilé Povogui Ismael Béavogui

This study analyzed the trend and factors associated with medical-surgical complications in patients discharged from leprosy multidrug therapy at the Centre Hospitalier Régional Spécialisé (CHRS), in Macenta, Republic of Guinea. This was a retro 2012 ( = 54) and 2013 ( = 35) and then a slight decrease between 2014 ( = 34) and 2017 ( = 26). From 2019 ( = 18) to 2021 ( = 1), a significant d spective study using routine secondary data from 2012 to 2021.

View Article and Find Full Text PDF

Similar Publications

The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.

Lancet Digit Health

January 2025

Department of Biomedical Informatics, Medical School, Harvard University, Boston, MA, USA. Electronic address:

Maria Clara Saad Menezes Alexander F Hoffmann Amelia L M Tan Mariné Nalbandyan Gilbert S Omenn

Background: Patient notes contain substantial information but are difficult for computers to analyse due to their unstructured format. Large-language models (LLMs), such as Generative Pre-trained Transformer 4 (GPT-4), have changed our ability to process text, but we do not know how effectively they handle medical notes. We aimed to assess the ability of GPT-4 to answer predefined questions after reading medical notes in three different languages.

View Article and Find Full Text PDF

Similar Publications

Case report and literature review: removal of a mercury thermometer from the abdomen of a 16-year-old boy under laparoscopy.

Front Surg

December 2024

Gastrointestinal Surgery Department, Baotou Central Hospital, Baotou, Inner Mongolia, China.

Runjie Hou Jijun Wang Jing Guo Mingyue Du Zhenyu Dong

Introduction: The incidence of foreign bodies within the human body is uncommon, with thermometers representing an exceptionally rare subset of such cases. The management of these cases is particularly challenging due to the fragility of mercury thermometers and the toxic nature of their contents.

Case Description: A 16-year-old male adolescent presented with a three-month history of persistent, dull pain localized to the right inguinal region.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!