Motivation: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.

Results: To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.

Availability And Implementation: The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT).

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11025658PMC
http://dx.doi.org/10.1093/bioinformatics/btab083DOI Listing

Publication Analysis

Top Keywords

pre-trained bidirectional
8
bidirectional encoder
8
transformers model
8
semantic relationship
8
pre-trained dnabert
8
dnabert
7
dnabert pre-trained
4
encoder representations
4
representations transformers
4
model
4

Similar Publications

Glaucoma, a leading cause of irreversible blindness, necessitates early detection for accurate and timely intervention to prevent irreversible vision loss. In this study, we present a novel deep learning framework that leverages the diagnostic value of 3D Optical Coherence Tomography (OCT) imaging for automated glaucoma detection. In this framework, we integrate a pre-trained Vision Transformer on retinal data for rich slice-wise feature extraction and a bidirectional Gated Recurrent Unit for capturing inter-slice spatial dependencies.

View Article and Find Full Text PDF

Reading comprehension, a fundamental cognitive ability essential for knowledge acquisition, is a complex skill, with a notable number of learners lacking proficiency in this domain. This study introduces innovative tasks for Brain-Computer Interface (BCI), predicting the relevance of words or tokens read by individuals to the target inference words. We use state-of-the-art Large Language Models (LLMs) to guide a new reading embedding representation in training.

View Article and Find Full Text PDF

The discovery and development of novel pharmaceutical agents is characterized by high costs, lengthy timelines, and significant safety concerns. Traditional drug discovery involves pharmacologists manually screening drug molecules against protein targets, focusing on binding within protein cavities. However, this manual process is slow and inherently limited.

View Article and Find Full Text PDF

Purpose: The primary objective of this research is to enhance the accuracy and efficiency of information extraction from radiology reports. In addressing this objective, the study aims to develop and evaluate a deep learning framework for named entity recognition (NER).

Methods: We used a synthetic dataset of 1,056 Turkish radiology reports created and labeled by the radiologists in our research team.

View Article and Find Full Text PDF

Introduction: The rapid evolution of the Internet of Things (IoT) and Artificial Intelligence (AI) has opened new possibilities for public healthcare. Effective integration of these technologies is essential to ensure precise and efficient healthcare delivery. This study explores the application of IoT-enabled, AI-driven systems for detecting and managing Dry Eye Disease (DED), emphasizing the use of prompt engineering to enhance system performance.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!