CARDBiomedBench: A Benchmark for Evaluating Large Language Model Performance in Biomedical Research.

Backgrounds: Biomedical research requires sophisticated understanding and reasoning across multiple specializations. While large language models (LLMs) show promise in scientific applications, their capability to safely and accurately support complex biomedical research remains uncertain.

Methods: We present , a novel question-and-answer benchmark for evaluating LLMs in biomedical research. For our pilot implementation, we focus on neurodegenerative diseases (NDDs), a domain requiring integration of genetic, molecular, and clinical knowledge. The benchmark combines expert-annotated question-answer (Q/A) pairs with semi-automated data augmentation, drawing from authoritative public resources including drug development data, genome-wide association studies (GWAS), and Summary-data based Mendelian Randomization (SMR) analyses. We evaluated seven private and open-source LLMs across ten biological categories and nine reasoning skills, using novel metrics to assess both response quality and safety.

Results: Our benchmark comprises over 68,000 Q/A pairs, enabling robust evaluation of LLM performance. Current state-of-the-art models show significant limitations: models like Claude-3.5-Sonnet demonstrates excessive caution (Response Quality Rate: 25% [95% CI: 25% ± 1], Safety Rate: 76% ± 1), while others like ChatGPT-4o exhibits both poor accuracy and unsafe behavior (Response Quality Rate: 37% ± 1, Safety Rate: 31% ± 1). These findings reveal fundamental gaps in LLMs' ability to handle complex biomedical information.

Conclusion: CARDBiomedBench establishes a rigorous standard for assessing LLM capabilities in biomedical research. Our pilot evaluation in the NDD domain reveals critical limitations in current models' ability to safely and accurately process complex scientific information. Future iterations will expand to other biomedical domains, supporting the development of more reliable AI systems for accelerating scientific discovery.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11760394	PMC
http://dx.doi.org/10.1101/2025.01.15.633272	DOI Listing

Publication Analysis

Top Keywords

response quality

benchmark evaluating

large language

safely accurately

complex biomedical

biomedical pilot

q/a pairs

quality rate

safety rate

biomedical

Similar Publications

Marigold: a machine learning-based web app for zebrafish pose tracking.

BMC Bioinformatics

January 2025

Biology Department, University of Massachusetts Amherst, Amherst, MA, USA.

Gregory Teicher R Madison Riffe Wayne Barnaby Gabrielle Martin Benjamin E Clayton

Background: High-throughput behavioral analysis is important for drug discovery, toxicological studies, and the modeling of neurological disorders such as autism and epilepsy. Zebrafish embryos and larvae are ideal for such applications because they are spawned in large clutches, develop rapidly, feature a relatively simple nervous system, and have orthologs to many human disease genes. However, existing software for video-based behavioral analysis can be incompatible with recordings that contain dynamic backgrounds or foreign objects, lack support for multiwell formats, require expensive hardware, and/or demand considerable programming expertise.

View Article and Find Full Text PDF

Similar Publications

Exercise medicine as adjunct therapy during RADIation for CAncer of the prostaTE to improve treatment efficacy - protocol for the ERADICATE study: a phase II randomised controlled trial.

BMC Cancer

January 2025

Exercise Medicine Research Institute, Edith Cowan University, 270 Joondalup Drive, Joondalup, WA, 6027, Australia.

Oliver Schumacher Robert U Newton Colin Tang Raphael Chee Sjoerd B Vos

Background: Tumour hypoxia resulting from inadequate perfusion is common in many solid tumours, including prostate cancer, and constitutes a major limiting factor in radiation therapy that contributes to treatment resistance. Emerging research in preclinical animal models indicates that exercise has the potential to enhance the efficacy of cancer treatment by modulating tumour perfusion and reducing hypoxia; however, evidence from randomised controlled trials is currently lacking. The 'Exercise medicine as adjunct therapy during RADIation for CAncer of the prostaTE' (ERADICATE) study is designed to investigate the impact of exercise on treatment response, tumour physiology, and adverse effects of treatment in prostate cancer patients undergoing external beam radiation therapy (EBRT).

View Article and Find Full Text PDF

Similar Publications

Individuals' Desire for Social Needs Sharing Among Healthcare Providers: Findings from the 2022 Health Information National Trends Survey.

J Gen Intern Med

January 2025

The Center for the Advancement of Team Science, Analytics, and Systems Thinking (CATALYST), College of Medicine, The Ohio State University, Columbus, OH, USA.

Ramona G Olvera Christine M Swoboda Joshua J Joseph Seuli Bose-Brill Ann Scheck McAlearney

Background: Increasingly, health systems are collecting and using social needs data, yet there is limited information about individuals' preferences for how social needs information is shared among providers for treatment purposes.

Objective: To explore the connection between experiencing social needs and concerns about healthcare providers sharing social needs information.

Design And Participants: A nationally representative, cross-sectional study of 6252 US community-dwelling adults (≥ 18 years of age) who responded to the Health Information National Trends Survey (HINTS 6) (response rate 28.

View Article and Find Full Text PDF

Similar Publications

Very Early Health Technology Assessment for Potential Predictive Biomarkers in the Treatment of Advanced Non-Small Cell Lung Cancer.

Pharmacoecon Open

January 2025

Division of Psychosocial Research and Epidemiology, The Netherlands Cancer Institute, Amsterdam, The Netherlands.

Leila-Sophie Otten Alessandra I G Buma Berber Piet Rob Ter Heine Michel M van den Heuvel

Objectives: Immune checkpoint inhibitor (ICI)-containing treatment is currently prescribed as first-line treatment for all patients with advanced non-small cell lung cancer (NSCLC) without targetable driver mutations. However, only 30-45% of patients show no progression within 12 months after treatment start. Various biomarkers are being studied to save costly and potentially harmful treatment in non-responders.

View Article and Find Full Text PDF

Similar Publications

Extracellular vesicles in seminal plasma of Sahiwal cattle bulls carry a differential abundance of sperm fertility-associated proteins for augmenting the functional quality of low-fertile bull spermatozoa.

Sci Rep

January 2025

Animal Genomics Laboratory, Animal Biotechnology Division, ICAR-National Dairy Research Institute, Karnal, Haryana, India.

Ankit Pal Seema Karanwal Mir Ahmad Habib Fanny Josan Vikrant Gaur

Poor male fertility significantly affects dairy production, primarily due to low conception rates (CR) in bulls, even when cows are inseminated with morphologically normal sperm. Seminal plasma is a key factor in evaluating the fertilizing ability of bull semen. The extracellular vesicles (EVs) in seminal plasma contain fertility-associated proteins like SPAM1, ADAM7, and SP10, which influence sperm function and fertilizing potential.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!