A DNA language model based on multispecies alignment predicts the effects of genome-wide variants.

Nat Biotechnol

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, US.

Published: January 2025

Protein language models have demonstrated remarkable performance in predicting the effects of missense variants but DNA language models have not yet shown a competitive edge for complex genomes such as that of humans. This limitation is particularly evident when dealing with the vast complexity of noncoding regions that comprise approximately 98% of the human genome. To tackle this challenge, we introduce GPN-MSA (genomic pretrained network with multiple-sequence alignment), a framework that leverages whole-genome alignments across multiple species while taking only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC and OMIM), experimental functional assays (deep mutational scanning and DepMap) and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and noncoding variants. We provide precomputed scores for all ~9 billion possible single-nucleotide variants in the human genome. We anticipate that our advances in genome-wide variant effect prediction will enable more accurate rare disease diagnosis and improve rare variant burden testing.

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41587-024-02511-wDOI Listing

Publication Analysis

Top Keywords

human genome
12
dna language
8
language models
8
language model
4
model based
4
based multispecies
4
multispecies alignment
4
alignment predicts
4
predicts effects
4
effects genome-wide
4

Similar Publications

CASP5 associated with PANoptosis promotes tumorigenesis and progression of clear cell renal cell carcinoma.

Cancer Cell Int

January 2025

Institute for Genome Engineered Animal Models of Human Diseases, National Center of Genetically Engineered Animal Models for International Research, Dalian Medical University, 9 West Section Lvshun South Road, Dalian, 116044, China.

Clear cell renal cell carcinoma (ccRCC) is a globally severe cancer with an unfavorable prognosis. PANoptosis, a form of cell death regulated by PANoptosomes, plays a role in numerous cancer types. However, the specific roles of genes associated with PANoptosis in the development and advancement of ccRCC remain unclear.

View Article and Find Full Text PDF

Interaction study of the effects of environmental exposure and gene polymorphisms of inflammatory and immune-active factors on chronic obstructive pulmonary disease.

Respir Res

January 2025

Center for Endemic Disease Control, Chinese Center for Disease Control and Prevention, Center for Chronic Disease Prevention and Control, Harbin Medical University, Harbin, 150081, People's Republic of China.

Background: Chronic obstructive pulmonary disease (COPD) is a heterogeneous disease, influenced by both environmental and genetic factors. Single nucleotide polymorphism (SNP) in the human genome may influence the risk of developing COPD and the response to treatment. We assessed the effects of gene polymorphism of inflammatory and immune-active factors and gene-environment interaction on risk of COPD in middle-aged and older Chinese individuals.

View Article and Find Full Text PDF

Background: Treponemal diseases are a significant global health risk, presenting challenges to public health and severe consequences to individuals if left untreated. Despite numerous genomic studies on Treponema pallidum and the known possible biases introduced by the choice of the reference genome used for mapping, few investigations have addressed how these biases affect phylogenetic and evolutionary analysis of these bacteria. In this study, we ascertain the importance of selecting an appropriate genomic reference on phylogenetic and evolutionary analyses of T.

View Article and Find Full Text PDF

Genomic characterization of Escherichia coli with a polyketide synthase (pks) island isolated from ulcerative colitis patients.

BMC Genomics

January 2025

Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA.

The E. coli strains harboring the polyketide synthase (pks) island encode the genotoxin colibactin, a secondary metabolite reported to have severe implications for human health and for the progression of colorectal cancer. The present study involves whole-genome-wide comparison and phylogenetic analysis of pks harboring E.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!