Oversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT.

Shoya Wada Toshihiro Takeda Katsuki Okada Shirou Manabe Shozo Konishi Jun Kamohara Yasushi Matsumura

Artif Intell Med

Department of Medical Informatics, Osaka University Graduate School of Medicine, Japan.

Published: July 2024

Background: Pretraining large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural language processing. With the introduction of transformer-based language models, such as bidirectional encoder representations from transformers (BERT), the performance of information extraction from free text has improved significantly in both the general and medical domains. However, it is difficult to train specific BERT models to perform well in domains for which few databases of a high quality and large size are publicly available.

Objective: We hypothesized that this problem could be addressed by oversampling a domain-specific corpus and using it for pretraining with a larger corpus in a balanced manner. In the present study, we verified our hypothesis by developing pretraining models using our method and evaluating their performance.

Methods: Our proposed method was based on the simultaneous pretraining of models with knowledge from distinct domains after oversampling. We conducted three experiments in which we generated (1) English biomedical BERT from a small biomedical corpus, (2) Japanese medical BERT from a small medical corpus, and (3) enhanced biomedical BERT pretrained with complete PubMed abstracts in a balanced manner. We then compared their performance with those of conventional models.

Results: Our English BERT pretrained using both general and small medical domain corpora performed sufficiently well for practical use on the biomedical language understanding evaluation (BLUE) benchmark. Moreover, our proposed method was more effective than the conventional methods for each biomedical corpus of the same corpus size in the general domain. Our Japanese medical BERT outperformed the other BERT models built using a conventional method for almost all the medical tasks. The model demonstrated the same trend as that of the first experiment in English. Further, our enhanced biomedical BERT model, which was not pretrained on clinical notes, achieved superior clinical and biomedical scores on the BLUE benchmark with an increase of 0.3 points in the clinical score and 0.5 points in the biomedical score. These scores were above those of the models trained without our proposed method.

Conclusions: Well-balanced pretraining using oversampling instances derived from a corpus appropriate for the target task allowed us to construct a high-performance BERT model.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.artmed.2024.102889	DOI Listing

Publication Analysis

Top Keywords

biomedical bert

bert

medical bert

biomedical

bidirectional encoder

encoder representations

representations transformers

transformers bert

language models

bert models

Similar Publications

Protocol for the systematic review of age and sex in preclinical models of age-correlated diseases.

F1000Res

January 2025

German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Berlin, Germany.

Kai Diederich Matthias Steinfath Alexandra Bannach-Brown Bettina Bert Daniel Butzke

The translation of animal-based biomedical research into clinical research is often inadequate. Maximizing translation should be central to animal research on human diseases, guiding researchers in study design and animal model selection. However, practical considerations often drive the choice of animal model, which may not always reflect key patient characteristics, such as sex and age, impacting the disease's course.

View Article and Find Full Text PDF

Similar Publications

Analysis of longitudinal social media for monitoring symptoms during a pandemic.

J Biomed Inform

January 2025

School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China; Department of Medicine, Harvard Medical School, Boston, MA 02115, USA. Electronic address:

Shixu Lin Lucas Garay Yining Hua Zhijiang Guo Wanxin Li

Objective: Current studies leveraging social media data for disease monitoring face challenges like noisy colloquial language and insufficient tracking of user disease progression in longitudinal data settings. This study aims to develop a pipeline for collecting, cleaning, and analyzing large-scale longitudinal social media data for disease monitoring, with a focus on COVID-19 pandemic.

Materials And Methods: This pipeline initiates by screening COVID-19 cases from tweets spanning February 1, 2020, to April 30, 2022.

View Article and Find Full Text PDF

Similar Publications

Characterization and automated classification of sentences in the biomedical literature: a case study for biocuration of gene expression and protein kinase activity.

bioRxiv

January 2025

Division of Biology and Biological Engineering, 1200 E. California Boulevard, California Institute of Technology, Pasadena, CA 91125, USA.

Daniela Raciti Kimberly M Van Auken Valerio Arnaboldi Christopher J Tabone Hans-Michael Muller

Biological knowledgebases are essential resources for biomedical researchers, providing ready access to gene function and genomic data. Professional, manual curation of knowledgebases, however, is labor-intensive and thus high-performing machine learning methods that improve biocuration efficiency are needed. Here we report on sentence-level classification to identify biocuration-relevant sentences in the full text of published references for two gene function data types: gene expression and protein kinase activity.

View Article and Find Full Text PDF

Similar Publications

Excessively long screws may delay healing in Intramedullary Headless Screw Fixation for Diaphyseal Metacarpal Fractures.

Hand Surg Rehabil

January 2025

SMRC Sports Medical Research Center, BIOMED Biomedical Research Institute, Faculty of Medicine and Life Sciences, Hasselt University, Martelarenlaan 42, 3500 Hasselt, Belgium; Division of Sport Science, Faculty of Medicine and Health Sciences, Stellenbosch University, Corner of Ryneveld and Victoria Street, 7600 Stellenbosch, South Africa.

Bert Vanmierlo Pieter Van Geel Joris Duerinckx Bert O Eijnde

View Article and Find Full Text PDF

Similar Publications

From Text to Translation: Using Language Models to Prioritize Variants for Clinical Review.

medRxiv

December 2024

Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, 02115, MA, United States.

Weijiang Li Xiaomin Li Ethan Lavallee Alice Saparov Marinka Zitnik

Despite rapid advances in genomic sequencing, most rare genetic variants remain insufficiently characterized for clinical use, limiting the potential of personalized medicine. When classifying whether a variant is pathogenic, clinical labs adhere to diagnostic guidelines that comprehensively evaluate many forms of evidence including case data, computational predictions, and functional screening. While a substantial amount of clinical evidence has been developed for these variants, the majority cannot be definitively classified as 'pathogenic' or 'benign', and thus persist as 'Variants of Uncertain Significance' (VUS).

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!