Backgrounds: Biomedical research requires sophisticated understanding and reasoning across multiple specializations. While large language models (LLMs) show promise in scientific applications, their capability to safely and accurately support complex biomedical research remains uncertain.

Methods: We present , a novel question-and-answer benchmark for evaluating LLMs in biomedical research. For our pilot implementation, we focus on neurodegenerative diseases (NDDs), a domain requiring integration of genetic, molecular, and clinical knowledge. The benchmark combines expert-annotated question-answer (Q/A) pairs with semi-automated data augmentation, drawing from authoritative public resources including drug development data, genome-wide association studies (GWAS), and Summary-data based Mendelian Randomization (SMR) analyses. We evaluated seven private and open-source LLMs across ten biological categories and nine reasoning skills, using novel metrics to assess both response quality and safety.

Results: Our benchmark comprises over 68,000 Q/A pairs, enabling robust evaluation of LLM performance. Current state-of-the-art models show significant limitations: models like Claude-3.5-Sonnet demonstrates excessive caution (Response Quality Rate: 25% [95% CI: 25% ± 1], Safety Rate: 76% ± 1), while others like ChatGPT-4o exhibits both poor accuracy and unsafe behavior (Response Quality Rate: 37% ± 1, Safety Rate: 31% ± 1). These findings reveal fundamental gaps in LLMs' ability to handle complex biomedical information.

Conclusion: CARDBiomedBench establishes a rigorous standard for assessing LLM capabilities in biomedical research. Our pilot evaluation in the NDD domain reveals critical limitations in current models' ability to safely and accurately process complex scientific information. Future iterations will expand to other biomedical domains, supporting the development of more reliable AI systems for accelerating scientific discovery.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11760394PMC
http://dx.doi.org/10.1101/2025.01.15.633272DOI Listing

Publication Analysis

Top Keywords

response quality
12
benchmark evaluating
8
large language
8
safely accurately
8
complex biomedical
8
biomedical pilot
8
q/a pairs
8
quality rate
8
safety rate
8
biomedical
7

Similar Publications

Background: High-throughput behavioral analysis is important for drug discovery, toxicological studies, and the modeling of neurological disorders such as autism and epilepsy. Zebrafish embryos and larvae are ideal for such applications because they are spawned in large clutches, develop rapidly, feature a relatively simple nervous system, and have orthologs to many human disease genes. However, existing software for video-based behavioral analysis can be incompatible with recordings that contain dynamic backgrounds or foreign objects, lack support for multiwell formats, require expensive hardware, and/or demand considerable programming expertise.

View Article and Find Full Text PDF

Background: Tumour hypoxia resulting from inadequate perfusion is common in many solid tumours, including prostate cancer, and constitutes a major limiting factor in radiation therapy that contributes to treatment resistance. Emerging research in preclinical animal models indicates that exercise has the potential to enhance the efficacy of cancer treatment by modulating tumour perfusion and reducing hypoxia; however, evidence from randomised controlled trials is currently lacking. The 'Exercise medicine as adjunct therapy during RADIation for CAncer of the prostaTE' (ERADICATE) study is designed to investigate the impact of exercise on treatment response, tumour physiology, and adverse effects of treatment in prostate cancer patients undergoing external beam radiation therapy (EBRT).

View Article and Find Full Text PDF

Individuals' Desire for Social Needs Sharing Among Healthcare Providers: Findings from the 2022 Health Information National Trends Survey.

J Gen Intern Med

January 2025

The Center for the Advancement of Team Science, Analytics, and Systems Thinking (CATALYST), College of Medicine, The Ohio State University, Columbus, OH, USA.

Background: Increasingly, health systems are collecting and using social needs data, yet there is limited information about individuals' preferences for how social needs information is shared among providers for treatment purposes.

Objective: To explore the connection between experiencing social needs and concerns about healthcare providers sharing social needs information.

Design And Participants: A nationally representative, cross-sectional study of 6252 US community-dwelling adults (≥ 18 years of age) who responded to the Health Information National Trends Survey (HINTS 6) (response rate 28.

View Article and Find Full Text PDF

Objectives: Immune checkpoint inhibitor (ICI)-containing treatment is currently prescribed as first-line treatment for all patients with advanced non-small cell lung cancer (NSCLC) without targetable driver mutations. However, only 30-45% of patients show no progression within 12 months after treatment start. Various biomarkers are being studied to save costly and potentially harmful treatment in non-responders.

View Article and Find Full Text PDF

Poor male fertility significantly affects dairy production, primarily due to low conception rates (CR) in bulls, even when cows are inseminated with morphologically normal sperm. Seminal plasma is a key factor in evaluating the fertilizing ability of bull semen. The extracellular vesicles (EVs) in seminal plasma contain fertility-associated proteins like SPAM1, ADAM7, and SP10, which influence sperm function and fertilizing potential.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!