BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning.

Brief Bioinform

Science Center for Future Foods, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China.

Published: March 2024

Transcription factors (TFs) are proteins essential for regulating genetic transcriptions by binding to transcription factor binding sites (TFBSs) in DNA sequences. Accurate predictions of TFBSs can contribute to the design and construction of metabolic regulatory systems based on TFs. Although various deep-learning algorithms have been developed for predicting TFBSs, the prediction performance needs to be improved. This paper proposes a bidirectional encoder representations from transformers (BERT)-based model, called BERT-TFBS, to predict TFBSs solely based on DNA sequences. The model consists of a pre-trained BERT module (DNABERT-2), a convolutional neural network (CNN) module, a convolutional block attention module (CBAM) and an output module. The BERT-TFBS model utilizes the pre-trained DNABERT-2 module to acquire the complex long-term dependencies in DNA sequences through a transfer learning approach, and applies the CNN module and the CBAM to extract high-order local features. The proposed model is trained and tested based on 165 ENCODE ChIP-seq datasets. We conducted experiments with model variants, cross-cell-line validations and comparisons with other models. The experimental results demonstrate the effectiveness and generalization capability of BERT-TFBS in predicting TFBSs, and they show that the proposed model outperforms other deep-learning models. The source code for BERT-TFBS is available at https://github.com/ZX1998-12/BERT-TFBS.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11066948PMC
http://dx.doi.org/10.1093/bib/bbae195DOI Listing

Publication Analysis

Top Keywords

dna sequences
12
bert-based model
8
transcription factor
8
factor binding
8
binding sites
8
transfer learning
8
predicting tfbss
8
cnn module
8
module cbam
8
proposed model
8

Similar Publications

Long non-coding RNAs (lncRNAs) are among the most abundant types of non-coding RNAs in the genome and exhibit particularly high expression levels in the brain, where they play crucial roles in various neurophysiological and neuropathological processes. Although ischemic stroke is a complex multifactorial disease, the involvement of brain-derived lncRNAs in its intricate regulatory networks remains inadequately understood. In this study, we established a cerebral ischemia-reperfusion injury model using middle cerebral artery occlusion (MCAO) in male Sprague-Dawley rats.

View Article and Find Full Text PDF

The gram-negative, facultative anaerobic bacterium Morganella morganii is linked to a number of illnesses, including nosocomial infections and urinary tract infections (UTIs). A clinical isolate from a UTI patient in Bangladesh was subjected to high-throughput whole genome sequencing and extensive bioinformatics analysis in order to gather knowledge about the genomic basis of bacterial defenses and pathogenicity in M. morganii.

View Article and Find Full Text PDF

Background And Aims: Ornamental hortensias are bred from a reservoir of over 200 species in the genus Hydrangea s.l. (Hydrangeaceae), and are valued in gardens, households and landscapes across the globe.

View Article and Find Full Text PDF

Human Oncostatin M deficiency underlies an inherited severe bone marrow failure syndrome.

J Clin Invest

January 2025

Laboratory of Genome Dynamics in the Immune, INSERM UMR 116, Équipe Labellisée LIGUE 2023, Paris, France.

Oncostatin M (OSM) is a cytokine with the unique ability to interact with both the OSM receptor (OSMR) and the leukemia inhibitory factor receptor (LIFR). On the other hand, OSMR interacts with IL31RA to form the interleukin-31 receptor. This intricate network of cytokines and receptors makes it difficult to understand the specific function of OSM.

View Article and Find Full Text PDF

A novel bacterium, designated 19SA41, was isolated from the air of the Icelandic volcanic island Surtsey. Cells of strain 19SA41 are Gram-stain-negative, strictly aerobic, non-motile rods and form pale yellow-pigmented colonies. The strain grows at 4-30 °C (optimum, 22 °C), at pH 6-10 (optimum, pH 7.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!