In this study, we present MedS-Bench, a comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench, spanning 11 high-level clinical tasks. We evaluate nine leading LLMs, e.g., MEDITRON, Llama 3, Mistral, GPT-4, Claude-3.5, etc. and found that most models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction-tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 5M instances with 19K instructions, across 122 tasks. To demonstrate the dataset's utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models on various clinical tasks. To promote further advancements, we have made MedS-Ins fully accessible and invite the research community to contribute to its expansion. Additionally, we have launched a dynamic leaderboard for MedS-Bench, to track the development progress of medical LLMs.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1038/s41746-024-01390-4 | DOI Listing |
Proc Natl Acad Sci U S A
February 2025
Max Planck Institute for Biological Cybernetics, Tübingen, Baden-Württemberg 72076, Germany.
Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advancement of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate.
View Article and Find Full Text PDFInt J Surg
January 2025
Department of Otolaryngology Head and Neck Surgery, The First Affiliated Hospital of Anhui Medical University, Hefei, Anhui Province, China.
Background And Objectives: Recent advances in multimodal large language models (MLLMs) have shown promise in medical image interpretation, yet their utility in surgical contexts remains unexplored. This study evaluates six MLLMs' performance in interpreting diverse imaging modalities for laryngeal cancer surgery.
Methods: We analyzed 169 images (X-rays, CT scans, laryngoscopy, and pathology findings) from 50 patients using six state-of-the-art MLLMs.
Am J Speech Lang Pathol
January 2025
Department of Speech and Hearing Science, The Ohio State University, Columbus.
Purpose: Vocabulary access is important for individuals who use augmentative and alternative communication (AAC), especially for children in the early stages of language learning. This study sought to understand how accurate speech-language pathologists (SLPs), teachers, and parents are in predicting the vocabulary needed by early symbolic communicators who use AAC in three contexts.
Method: Ten groups, each with a child who used AAC as their primary mode of communication and who was classified as an early symbolic communicator and their parent, teacher, and SLP, participated.
Int J Audiol
January 2025
Department of Otolaryngology, Vanderbilt University Medical Center, Nashville, TN, USA.
Objectives: An improvement in speech perception is a major well-documented benefit of cochlear implantation (CI), which is commonly discussed with CI candidates to set expectations. However, a large variability exists in speech perception outcomes. We evaluated the accuracy of clinical predictions of post-CI speech perception scores.
View Article and Find Full Text PDFBackgrounds: Biomedical research requires sophisticated understanding and reasoning across multiple specializations. While large language models (LLMs) show promise in scientific applications, their capability to safely and accurately support complex biomedical research remains uncertain.
Methods: We present , a novel question-and-answer benchmark for evaluating LLMs in biomedical research.
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!