Closing the gap between open source and commercial large language models for medical evidence summarization.

Gongbo Zhang Qiao Jin Yiliang Zhou Song Wang Betina Idnay Yiming Luo Elizabeth Park Jordan G Nestor Matthew E Spotnitz Ali Soroush Thomas R Campion Zhiyong Lu Chunhua Weng Yifan Peng

NPJ Digit Med

Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.

Published: September 2024

Large language models (LLMs) show potential in summarizing medical evidence but are often limited by issues such as lack of transparency when using proprietary models.
This study examines the effects of fine-tuning open-source LLMs like PRIMERA, LongT5, and Llama-2 to enhance their performance, using a dataset of systematic reviews and summaries.
Results indicate that fine-tuning improves the performance of open-source models, with LongT5 performing nearly as well as GPT-3.5, and smaller fine-tuned models sometimes outperforming larger models in evaluations.

Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to the proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance. Utilizing a benchmark dataset, MedReview, consisting of 8161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the performance of open-source models was all improved after fine-tuning. The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were manifested in both a human evaluation and a larger-scale GPT4-simulated evaluation.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11383939	PMC
http://dx.doi.org/10.1038/s41746-024-01239-w	DOI Listing

Publication Analysis

Top Keywords

large language

language models

medical evidence

proprietary llms

open-source llms

llms

models

performance

closing gap

gap open

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!