Enhancing personalized gene expression prediction from DNA sequences using genomic foundation models.

HGG Adv

Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, Minneapolis, MN, USA. Electronic address:

Published: October 2024

Artificial intelligence (AI)/deep learning (DL) models that predict molecular phenotypes like gene expression directly from DNA sequences have recently emerged. While these models have proven effective at capturing the variation across genes, their ability to explain inter-individual differences has been limited. We hypothesize that the performance gap can be narrowed through the use of pre-trained embeddings from the Nucleotide Transformer, a large foundation model trained on 3,000+ genomes. We train a transformer model using the pre-trained embeddings and compare its predictive performance to Enformer, the current state-of-the-art model, using genotype and expression data from 290 individuals. Our model significantly outperforms Enformer in terms of correlation across individuals, and narrows the performance gap with an elastic net regression approach that uses just the genetic variants as predictors. Although simple regression models have their advantages in personalized prediction tasks, DL approaches based on foundation models pre-trained on diverse genomes have unique strengths in flexibility and interpretability. With further methodological and computational improvements with more training data, these models may eventually predict molecular phenotypes from DNA sequences with an accuracy surpassing that of regression-based approaches. Our work demonstrates the potential for large pre-trained AI/DL models to advance functional genomics.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11416237PMC
http://dx.doi.org/10.1016/j.xhgg.2024.100347DOI Listing

Publication Analysis

Top Keywords

dna sequences
12
gene expression
8
foundation models
8
predict molecular
8
molecular phenotypes
8
performance gap
8
pre-trained embeddings
8
models
7
enhancing personalized
4
personalized gene
4

Similar Publications

A novel genotype of Babesia microti-like group in Ixodes montoyanus ticks parasitizing the Andean bear (Tremarctos ornatus) in Ecuador.

Exp Appl Acarol

January 2025

Laboratorio de Vectores y Enfermedades Transmitidas, Departamento de Ciencias Biológicas, CENUR Litoral Norte, Universidad de la República, Salto, Uruguay.

Babesia species (Piroplasmida) are hemoparasites that infect erythrocytes of mammals and birds and are mainly transmitted by hard ticks (Acari: Ixodidae). These hemoparasites are known to be the second most common parasites infecting mammals, after trypanosomes, and some species may cause malaria-like disease in humans. Diagnosis and understanding of Babesia diversity increasingly rely on genetic data obtained through molecular techniques.

View Article and Find Full Text PDF

Many cellular patterns exhibit a reaction-diffusion component, suggesting that Turing instability may contribute to pattern formation. However, biological gene-regulatory pathways are more complex than simple Turing activator-inhibitor models and generally do not require fine-tuning of parameters as dictated by the Turing conditions. To address these issues, we employ random matrix theory to analyze the Jacobian matrices of larger networks with robust statistical properties.

View Article and Find Full Text PDF

Acanthocephalan parasites are often overlooked in many areas of research, and satellitome and cytogenetic analyzes are no exception. The species of the genus Acanthocephalus are known for their very small chromosomes with ambiguous morphology, which makes karyotyping difficult. In this study, we performed the first satellitome analysis of three Acanthocephalus species to identify species- and chromosome-specific satellites that could serve as cytogenetic markers.

View Article and Find Full Text PDF

Piper longum, commonly known as long pepper, is highly valued for its bioactive alkaloid piperine, which has diverse pharmaceutical and culinary applications. In this study, we used high-throughput sequencing and de novo transcriptome assembly to analyze the transcriptomes of P. longum leaves, roots, and spikes.

View Article and Find Full Text PDF

Genetic diversity is crucial to secure the survival and sustainability of ecosystems. Given anthropogenic pressure, as well as the projected alterations connected with the level and circulation of water, riparian forests are of particular concern. In this paper, we assessed the genetic variation of black poplar - one of the keystone tree species of riverine forests.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!