Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports.

Radiology

From the Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School, 149 Thirteenth St, Charlestown, MA 02129 (F.J.D., T.R.B., M.C.C., A.E.K., C.P.B.); Department of Radiology, Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany (F.J.D., L.D., F.A.M., F.B., L.J.); Department of Pediatric Oncology, Dana-Farber Cancer Institute, Boston, Mass (L.J.); Department of Diagnostic and Interventional Radiology, Technical University of Munich, Munich, Germany (L.C.A.); Mass General Brigham Data Science Office, Boston, Mass (J.S., T.S., C.P.B.); Microsoft Health and Life Sciences (HLS), Redmond, Wash (J.M.); Klinikum rechts der Isar, Technical University of Munich, Munich, Germany (K.K.B.); Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.); and Department of Cardiovascular Radiology and Nuclear Medicine, Technical University of Munich, School of Medicine and Health, German Heart Center, TUM University Hospital, Munich, Germany (K.K.B.).

Published: October 2024

AI Article Synopsis

  • Advances in large language models (LLMs) have led to numerous commercial and open-source models, but there has been no real-world comparison of OpenAI's GPT-4 against these models for extracting information from radiology reports.
  • The study aimed to compare GPT-4 with several leading open-source LLMs in extracting relevant findings from chest radiograph reports using datasets from the ImaGenome and Massachusetts General Hospital.
  • Results showed that GPT-4 slightly outperformed the best open-source model, Llama 2-70B, in terms of accuracy scores, with both showing strong performance in extracting findings from the reports.

Article Abstract

Background Rapid advances in large language models (LLMs) have led to the development of numerous commercial and open-source models. While recent publications have explored OpenAI's GPT-4 to extract information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to leading open-source models. Purpose To compare different leading open-source LLMs to GPT-4 on the task of extracting relevant findings from chest radiograph reports. Materials and Methods Two independent datasets of free-text radiology reports from chest radiograph examinations were used in this retrospective study performed between February 2, 2024, and February 14, 2024. The first dataset consisted of reports from the ImaGenome dataset, providing reference standard annotations from the MIMIC-CXR database acquired between 2011 and 2016. The second dataset consisted of randomly selected reports created at the Massachusetts General Hospital between July 2019 and July 2021. In both datasets, the commercial models GPT-3.5 Turbo and GPT-4 were compared with open-source models that included Mistral-7B and Mixtral-8 × 7B (Mistral AI), Llama 2-13B and Llama 2-70B (Meta), and Qwen1.5-72B (Alibaba Group), as well as CheXbert and CheXpert-labeler (Stanford ML Group), in their ability to accurately label the presence of multiple findings in radiograph text reports using zero-shot and few-shot prompting. The McNemar test was used to compare F1 scores between models. Results On the ImaGenome dataset ( = 450), the open-source model with the highest score, Llama 2-70B, achieved micro F1 scores of 0.97 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.98 ( > .99 and < .001 for superiority of GPT-4). On the institutional dataset ( = 500), the open-source model with the highest score, an ensemble model, achieved micro F1 scores of 0.96 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.97 ( < .001 and > .99 for superiority of GPT-4). Conclusion Although GPT-4 was superior to open-source models in zero-shot report labeling, few-shot prompting with a small number of example reports closely matched the performance of GPT-4. The benefit of few-shot prompting varied across datasets and models. © RSNA, 2024

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11535875PMC
http://dx.doi.org/10.1148/radiol.241139DOI Listing

Publication Analysis

Top Keywords

few-shot prompting
20
open-source models
16
chest radiograph
12
zero-shot few-shot
12
gpt-4
10
models
9
open-source
8
commercial open-source
8
large language
8
language models
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!