Protein language models have demonstrated remarkable performance in predicting the effects of missense variants but DNA language models have not yet shown a competitive edge for complex genomes such as that of humans. This limitation is particularly evident when dealing with the vast complexity of noncoding regions that comprise approximately 98% of the human genome. To tackle this challenge, we introduce GPN-MSA (genomic pretrained network with multiple-sequence alignment), a framework that leverages whole-genome alignments across multiple species while taking only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC and OMIM), experimental functional assays (deep mutational scanning and DepMap) and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and noncoding variants. We provide precomputed scores for all ~9 billion possible single-nucleotide variants in the human genome. We anticipate that our advances in genome-wide variant effect prediction will enable more accurate rare disease diagnosis and improve rare variant burden testing.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1038/s41587-024-02511-w | DOI Listing |
Cancer Cell Int
January 2025
Institute for Genome Engineered Animal Models of Human Diseases, National Center of Genetically Engineered Animal Models for International Research, Dalian Medical University, 9 West Section Lvshun South Road, Dalian, 116044, China.
Clear cell renal cell carcinoma (ccRCC) is a globally severe cancer with an unfavorable prognosis. PANoptosis, a form of cell death regulated by PANoptosomes, plays a role in numerous cancer types. However, the specific roles of genes associated with PANoptosis in the development and advancement of ccRCC remain unclear.
View Article and Find Full Text PDFRespir Res
January 2025
Center for Endemic Disease Control, Chinese Center for Disease Control and Prevention, Center for Chronic Disease Prevention and Control, Harbin Medical University, Harbin, 150081, People's Republic of China.
Background: Chronic obstructive pulmonary disease (COPD) is a heterogeneous disease, influenced by both environmental and genetic factors. Single nucleotide polymorphism (SNP) in the human genome may influence the risk of developing COPD and the response to treatment. We assessed the effects of gene polymorphism of inflammatory and immune-active factors and gene-environment interaction on risk of COPD in middle-aged and older Chinese individuals.
View Article and Find Full Text PDFBMC Biol
January 2025
Department of Environmental Sciences, University of Basel, Basel, Switzerland.
Background: Treponemal diseases are a significant global health risk, presenting challenges to public health and severe consequences to individuals if left untreated. Despite numerous genomic studies on Treponema pallidum and the known possible biases introduced by the choice of the reference genome used for mapping, few investigations have addressed how these biases affect phylogenetic and evolutionary analysis of these bacteria. In this study, we ascertain the importance of selecting an appropriate genomic reference on phylogenetic and evolutionary analyses of T.
View Article and Find Full Text PDFBMC Genomics
January 2025
Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA.
The E. coli strains harboring the polyketide synthase (pks) island encode the genotoxin colibactin, a secondary metabolite reported to have severe implications for human health and for the progression of colorectal cancer. The present study involves whole-genome-wide comparison and phylogenetic analysis of pks harboring E.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!