PWAS (proteome-wide association study) is an innovative genetic association approach that complements widely used methods like GWAS (genome-wide association study). The PWAS approach involves consecutive phases. Initially, machine learning modeling and probabilistic considerations quantify the impact of genetic variants on protein-coding genes' biochemical functions.
View Article and Find Full Text PDFOver three percent of people carry a dominant pathogenic variant, yet only a fraction of carriers develop disease. Disease phenotypes from carriers of variants in the same gene range from mild to severe. Here, we investigate underlying mechanisms for this heterogeneity: variable variant effect sizes, carrier polygenic backgrounds, and modulation of carrier effect by genetic background (marginal epistasis).
View Article and Find Full Text PDFPredicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal.
View Article and Find Full Text PDFGenetic studies of human traits have revolutionized our understanding of the variation between individuals, and yet, the genetics of most traits is still poorly understood. In this review, we highlight the major open problems that need to be solved, and by discussing these challenges provide a primer to the field. We cover general issues such as population structure, epistasis and gene-environment interactions, data-related issues such as ancestry diversity and rare genetic variants, and specific challenges related to heritability estimates, genetic association studies, and polygenic risk scores.
View Article and Find Full Text PDFSummary: Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins.
View Article and Find Full Text PDFNAR Genom Bioinform
September 2021
Human genetic variation in coding regions is fundamental to the study of protein structure and function. Most methods for interpreting missense variants consider substitution measures derived from homologous proteins across different species. In this study, we introduce human-specific amino acid (AA) substitution matrices that are based on genetic variations in the modern human population.
View Article and Find Full Text PDFThe characterization of germline genetic variation affecting cancer risk, known as cancer predisposition, is fundamental to preventive and personalized medicine. Studies of genetic cancer predisposition typically identify significant genomic regions based on family-based cohorts or genome-wide association studies (GWAS). However, the results of such studies rarely provide biological insight or functional interpretation.
View Article and Find Full Text PDFOne of the major challenges in the post-genomic era is elucidating the genetic basis of human diseases. In recent years, studies have shown that polygenic risk scores (), based on aggregated information from millions of variants across the human genome, can estimate individual risk for common diseases. In practice, the current medical practice still predominantly relies on physiological and clinical indicators to assess personal disease risk.
View Article and Find Full Text PDFComput Struct Biotechnol J
March 2021
Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins.
View Article and Find Full Text PDFContemporary catalogues of cancer driver genes rely primarily on high mutation rates as evidence for gene selection in tumors. Here, we present The Functional Alteration Bias Recovery In Coding-regions Cancer Portal, a comprehensive catalogue of gene selection in cancer based purely on the biochemical functional effects of mutations at the protein level. Gene selection in the portal is quantified by combining genomics data with rich proteomic annotations.
View Article and Find Full Text PDFWe introduce Proteome-Wide Association Study (PWAS), a new method for detecting gene-phenotype associations mediated by protein function alterations. PWAS aggregates the signal of all variants jointly affecting a protein-coding gene and assesses their overall impact on the protein's function using machine learning and probabilistic models. Subsequently, it tests whether the gene exhibits functional variability between individuals that correlates with the phenotype of interest.
View Article and Find Full Text PDFBackground: In recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals and cancer patients is the basis for seeking candidates for cancer predisposition genes.
View Article and Find Full Text PDFNucleic Acids Res
July 2019
Compiling the catalogue of genes actively involved in cancer is an ongoing endeavor, with profound implications to the understanding and treatment of the disease. An abundance of computational methods have been developed to screening the genome for candidate driver genes based on genomic data of somatic mutations in tumors. Existing methods make many implicit and explicit assumptions about the distribution of random mutations.
View Article and Find Full Text PDFViruses are the most prevalent infectious agents, populating almost every ecosystem on earth. Most viruses carry only a handful of genes supporting their replication and the production of capsids. It came as a great surprise in 2003 when the first giant virus was discovered and found to have a >1 Mbp genome encoding almost a thousand proteins.
View Article and Find Full Text PDFDetermining residue-level protein properties, such as sites of post-translational modifications (PTMs), is vital to understanding protein function. Experimental methods are costly and time-consuming, while traditional rule-based computational methods fail to annotate sites lacking substantial similarity. Machine Learning (ML) methods are becoming fundamental in annotating unknown proteins and their heterogeneous properties.
View Article and Find Full Text PDFBackground: Viruses are the simplest replicating units, characterized by a limited number of coding genes and an exceptionally high rate of overlapping genes. We sought a unified evolutionary explanation that accounts for their genome sizes, gene overlapping and capsid properties.
Results: We performed an unbiased statistical analysis of ~100 families within ~400 genera that comprise the currently known viral world.