Closing the gap between open-source and commercial large language models for medical evidence summarization.

Gongbo Zhang Qiao Jin Yiliang Zhou Song Wang Betina R Idnay Yiming Luo Elizabeth Park Jordan G Nestor Matthew E Spotnitz Ali Soroush Thomas Campion Zhiyong Lu Chunhua Weng Yifan Peng

ArXiv

Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.

Published: July 2024

Large language models (LLMs) show potential in summarizing medical evidence, but using proprietary models can lead to issues like lack of transparency and reliance on specific vendors.
This study focused on enhancing the performance of open-source LLMs by fine-tuning three models—PRIMERA, LongT5, and Llama-2—using a dataset of 8,161 systematic reviews and summaries.
Fine-tuning resulted in significant performance improvements, with LongT5 performing similarly to GPT-3.5 in certain settings, indicating that smaller models can outperform larger models in specific tasks, like summarizing medical evidence.

Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance in summarizing medical evidence. Utilizing a benchmark dataset, MedReview, consisting of 8,161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the fine-tuned LLMs obtained an increase of 9.89 in ROUGE-L (95% confidence interval: 8.94-10.81), 13.21 in METEOR score (95% confidence interval: 12.05-14.37), and 15.82 in CHRF score (95% confidence interval: 13.89-16.44). The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were also manifested in both human and GPT4-simulated evaluations. Our results can be applied to guide model selection for tasks demanding particular domain knowledge, such as medical evidence summarization.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11451644	PMC

Publication Analysis

Top Keywords

medical evidence

95% confidence

confidence interval

large language

language models

evidence summarization

summarizing medical

proprietary llms

open-source llms

score 95%

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered