Towards evaluating and building versatile large language models for medicine.

Chaoyi Wu Pengcheng Qiu Jinxin Liu Hongfei Gu Na Li Ya Zhang Yanfeng Wang Weidi Xie

NPJ Digit Med

Shanghai Jiao Tong University, Shanghai, China.

Published: January 2025

In this study, we present MedS-Bench, a comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench, spanning 11 high-level clinical tasks. We evaluate nine leading LLMs, e.g., MEDITRON, Llama 3, Mistral, GPT-4, Claude-3.5, etc. and found that most models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction-tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 5M instances with 19K instructions, across 122 tasks. To demonstrate the dataset's utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models on various clinical tasks. To promote further advancements, we have made MedS-Ins fully accessible and invite the research community to contribute to its expansion. Additionally, we have launched a dynamic leaderboard for MedS-Bench, to track the development progress of medical LLMs.

Download full-text PDF	Source
http://dx.doi.org/10.1038/s41746-024-01390-4	DOI Listing

Publication Analysis

Top Keywords

large language

language models

clinical tasks

evaluating building

building versatile

versatile large

language

models

models medicine

medicine study

Similar Publications

How should the advancement of large language models affect the practice of science?

Proc Natl Acad Sci U S A

February 2025

Max Planck Institute for Biological Cybernetics, Tübingen, Baden-Württemberg 72076, Germany.

Marcel Binz Stephan Alaniz Adina Roskies Balazs Aczel Carl T Bergstrom

Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advancement of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate.

View Article and Find Full Text PDF

Similar Publications

Multimodal large language models address clinical queries in laryngeal cancer surgery: a comparative evaluation of image interpretation across different models.

Int J Surg

January 2025

Department of Otolaryngology Head and Neck Surgery, The First Affiliated Hospital of Anhui Medical University, Hefei, Anhui Province, China.

Bingyu Liang Yifan Gao Taibao Wang Lei Zhang Qin Wang

Background And Objectives: Recent advances in multimodal large language models (MLLMs) have shown promise in medical image interpretation, yet their utility in surgical contexts remains unexplored. This study evaluates six MLLMs' performance in interpreting diverse imaging modalities for laryngeal cancer surgery.

Methods: We analyzed 169 images (X-rays, CT scans, laryngoscopy, and pathology findings) from 50 patients using six state-of-the-art MLLMs.

View Article and Find Full Text PDF

Similar Publications

Seeing Into the Future: Adults' Accuracy Predicting the Vocabulary of Early Symbolic Communicators Who Use Augmentative and Alternative Communication.

Am J Speech Lang Pathol

January 2025

Department of Speech and Hearing Science, The Ohio State University, Columbus.

Bethany J Frick Semmler Hannah Kitsmiller Allison Bean

Purpose: Vocabulary access is important for individuals who use augmentative and alternative communication (AAC), especially for children in the early stages of language learning. This study sought to understand how accurate speech-language pathologists (SLPs), teachers, and parents are in predicting the vocabulary needed by early symbolic communicators who use AAC in three contexts.

Method: Ten groups, each with a child who used AAC as their primary mode of communication and who was classified as an early symbolic communicator and their parent, teacher, and SLP, participated.

View Article and Find Full Text PDF

Similar Publications

Accuracy and variability in clinical predictions of speech recognition outcomes for cochlear implant users.

Int J Audiol

January 2025

Department of Otolaryngology, Vanderbilt University Medical Center, Nashville, TN, USA.

Valeriy Shafiro Michael S Harris Berenice Ramirez Liping Du Aaron C Moberly

Objectives: An improvement in speech perception is a major well-documented benefit of cochlear implantation (CI), which is commonly discussed with CI candidates to set expectations. However, a large variability exists in speech perception outcomes. We evaluated the accuracy of clinical predictions of post-CI speech perception scores.

View Article and Find Full Text PDF

Similar Publications

CARDBiomedBench: A Benchmark for Evaluating Large Language Model Performance in Biomedical Research.

bioRxiv

January 2025

Owen Bianchi Maya Willey Chelsea X Alvarado Benjamin Danek Marzieh Khani

Backgrounds: Biomedical research requires sophisticated understanding and reasoning across multiple specializations. While large language models (LLMs) show promise in scientific applications, their capability to safely and accurately support complex biomedical research remains uncertain.

Methods: We present , a novel question-and-answer benchmark for evaluating LLMs in biomedical research.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!