Backgrounds: Biomedical research requires sophisticated understanding and reasoning across multiple specializations. While large language models (LLMs) show promise in scientific applications, their capability to safely and accurately support complex biomedical research remains uncertain.
Methods: We present , a novel question-and-answer benchmark for evaluating LLMs in biomedical research. For our pilot implementation, we focus on neurodegenerative diseases (NDDs), a domain requiring integration of genetic, molecular, and clinical knowledge. The benchmark combines expert-annotated question-answer (Q/A) pairs with semi-automated data augmentation, drawing from authoritative public resources including drug development data, genome-wide association studies (GWAS), and Summary-data based Mendelian Randomization (SMR) analyses. We evaluated seven private and open-source LLMs across ten biological categories and nine reasoning skills, using novel metrics to assess both response quality and safety.
Results: Our benchmark comprises over 68,000 Q/A pairs, enabling robust evaluation of LLM performance. Current state-of-the-art models show significant limitations: models like Claude-3.5-Sonnet demonstrates excessive caution (Response Quality Rate: 25% [95% CI: 25% ± 1], Safety Rate: 76% ± 1), while others like ChatGPT-4o exhibits both poor accuracy and unsafe behavior (Response Quality Rate: 37% ± 1, Safety Rate: 31% ± 1). These findings reveal fundamental gaps in LLMs' ability to handle complex biomedical information.
Conclusion: CARDBiomedBench establishes a rigorous standard for assessing LLM capabilities in biomedical research. Our pilot evaluation in the NDD domain reveals critical limitations in current models' ability to safely and accurately process complex scientific information. Future iterations will expand to other biomedical domains, supporting the development of more reliable AI systems for accelerating scientific discovery.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11760394 | PMC |
http://dx.doi.org/10.1101/2025.01.15.633272 | DOI Listing |
BMC Bioinformatics
January 2025
Biology Department, University of Massachusetts Amherst, Amherst, MA, USA.
Background: High-throughput behavioral analysis is important for drug discovery, toxicological studies, and the modeling of neurological disorders such as autism and epilepsy. Zebrafish embryos and larvae are ideal for such applications because they are spawned in large clutches, develop rapidly, feature a relatively simple nervous system, and have orthologs to many human disease genes. However, existing software for video-based behavioral analysis can be incompatible with recordings that contain dynamic backgrounds or foreign objects, lack support for multiwell formats, require expensive hardware, and/or demand considerable programming expertise.
View Article and Find Full Text PDFBMC Cancer
January 2025
Exercise Medicine Research Institute, Edith Cowan University, 270 Joondalup Drive, Joondalup, WA, 6027, Australia.
Background: Tumour hypoxia resulting from inadequate perfusion is common in many solid tumours, including prostate cancer, and constitutes a major limiting factor in radiation therapy that contributes to treatment resistance. Emerging research in preclinical animal models indicates that exercise has the potential to enhance the efficacy of cancer treatment by modulating tumour perfusion and reducing hypoxia; however, evidence from randomised controlled trials is currently lacking. The 'Exercise medicine as adjunct therapy during RADIation for CAncer of the prostaTE' (ERADICATE) study is designed to investigate the impact of exercise on treatment response, tumour physiology, and adverse effects of treatment in prostate cancer patients undergoing external beam radiation therapy (EBRT).
View Article and Find Full Text PDFJ Gen Intern Med
January 2025
The Center for the Advancement of Team Science, Analytics, and Systems Thinking (CATALYST), College of Medicine, The Ohio State University, Columbus, OH, USA.
Background: Increasingly, health systems are collecting and using social needs data, yet there is limited information about individuals' preferences for how social needs information is shared among providers for treatment purposes.
Objective: To explore the connection between experiencing social needs and concerns about healthcare providers sharing social needs information.
Design And Participants: A nationally representative, cross-sectional study of 6252 US community-dwelling adults (≥ 18 years of age) who responded to the Health Information National Trends Survey (HINTS 6) (response rate 28.
Pharmacoecon Open
January 2025
Division of Psychosocial Research and Epidemiology, The Netherlands Cancer Institute, Amsterdam, The Netherlands.
Objectives: Immune checkpoint inhibitor (ICI)-containing treatment is currently prescribed as first-line treatment for all patients with advanced non-small cell lung cancer (NSCLC) without targetable driver mutations. However, only 30-45% of patients show no progression within 12 months after treatment start. Various biomarkers are being studied to save costly and potentially harmful treatment in non-responders.
View Article and Find Full Text PDFSci Rep
January 2025
Animal Genomics Laboratory, Animal Biotechnology Division, ICAR-National Dairy Research Institute, Karnal, Haryana, India.
Poor male fertility significantly affects dairy production, primarily due to low conception rates (CR) in bulls, even when cows are inseminated with morphologically normal sperm. Seminal plasma is a key factor in evaluating the fertilizing ability of bull semen. The extracellular vesicles (EVs) in seminal plasma contain fertility-associated proteins like SPAM1, ADAM7, and SP10, which influence sperm function and fertilizing potential.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!