The sequence-structure-function relationships that ultimately generate the diversity of extant observed proteins is complex, as proteins bridge the gap between multiple informational and physical scales involved in nearly all cellular processes. One limitation of existing protein annotation databases such as UniProt is that less than 1% of proteins have experimentally verified functions, and computational methods are needed to fill in the missing information. Here, we demonstrate that a multi-aspect framework based on protein language models can learn sequence-structure-function representations of amino acid sequences, and can provide the foundation for sensitive sequence-structure-function aware protein sequence search and annotation. Based on this model, we introduce a multi-aspect information retrieval system for proteins, Protein-Vec, covering sequence, structure, and function aspects, that enables computational protein annotation and function prediction at tree-of-life scales.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10690258PMC
http://dx.doi.org/10.1101/2023.11.26.568742DOI Listing

Publication Analysis

Top Keywords

sequence structure
8
structure function
8
language models
8
protein annotation
8
proteins
5
learning sequence
4
function representations
4
representations proteins
4
proteins language
4
models sequence-structure-function
4

Similar Publications

Peptide , a C18 fatty acid-modified single-chain relaxin analogue, was recently identified as a potent, selective, and long-lasting relaxin family peptide receptor 1 (RXFP1) agonist. Further advanced pharmacokinetic profiling of this compound highlighted elevated levels of oxidative metabolism occurring in dogs and mini pigs but only marginally in rats. This study aimed to design long-lasting relaxin analogues with increased stability against metabolic oxidation while securing subnanomolar RXFP1 potency.

View Article and Find Full Text PDF

Transcription, a crucial step in the regulation of gene expression, is tightly controlled and involves several essential processes, such as chromatin organization, recognition of the specific genomic sequences, DNA binding, and ultimately recruiting the transcriptional machinery to facilitate transcript synthesis. At the center of this regulation are transcription factors (TFs), which comprise at least one DNA-binding domain (DBD) and an effector domain (ED). Although the structure and function of DBDs have been well studied, our knowledge of the structure and function of effector domains is limited.

View Article and Find Full Text PDF

A polysaccharide APS-1 II from a medicinal plant Angelica sinensis represents an interesting therapeutic agent against leukemia. However, the synthetic accessibility of the highly branched and complex APS-1 II polysaccharide with multiple 1, 2-cis-glycosidic linkages remains a difficult task, impeding the in-depth structure-activity relationship biological studies and the development of carbohydrates-based therapeutics against leukemia. Here, we report the first chemical synthesis of tridecasaccharide repeating unit together with shorter sequences 4-mer, 6-mer and 9-mer from APS-1 II polysaccharide via one-pot orthogonal glycosylation strategy based on glycosyl ortho-(1-phenylvinyl)benzoates, which precluded the potential issues such as aglycone transfer associated with one-pot assembly with thioglycosides.

View Article and Find Full Text PDF

Subsidence from coal mining is a major environmental issue, causing significant damage to soil structure. Soil microorganisms, highly sensitive to environmental changes, adapt accordingly. This study focused on four areas of the Burdai coal mine: a non-subsidence area (CK), half-yearly (HY), 1-year (OY), and 2-year (TY) subsidence areas.

View Article and Find Full Text PDF

Evolutionary Pro-To-Thr Mutation in the Intrinsically Disordered Domain of ANP32 Family Members Modulates Their Target Binding Modes.

Adv Sci (Weinh)

January 2025

Institute for Chemical Research (IIQ), Scientific Research Center "Isla de la Cartuja" (cicCartuja), University of Seville-CSIC, Avda. Americo Vespucio 49, Seville, 41092, Spain.

Gene duplication has allowed protein evolution toward novel functions and mechanisms. The differences between paralogous genes frequently rely on the sequence of disordered regions. For instance, in mammals, the chaperones ANP32A and ANP32B share a common evolutionary line and have some exchangeable functions based on their similar N-terminal domains.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!