Knowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we propose predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originate from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 33 million protein sequences. Replicating the conditions of CAFA3, our method reaches an F of 37 ± 2%, 50 ± 3%, and 57 ± 2% for BPO, MFO, and CCO, respectively. Numerically, this appears close to the top ten CAFA3 methods. When restricting the annotation transfer to proteins with < 20% pairwise sequence identity to the query, performance drops (F BPO 33 ± 2%, MFO 43 ± 3%, CCO 53 ± 2%); this still outperforms naïve sequence-based transfer. Preliminary results from CAFA4 appear to confirm these findings. Overall, this new concept is likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7806674 | PMC |
http://dx.doi.org/10.1038/s41598-020-80786-0 | DOI Listing |
Front Plant Sci
December 2024
Research and Innovation Centre, Fondazione Edmund Mach, San Michele all'Adige, TN, Italy.
Flavescence dorée (FD) poses a significant threat to grapevine health, with the American grapevine leafhopper, , serving as the primary vector. FD is responsible for yield losses and high production costs due to mandatory insecticide treatments, infected plant uprooting, and replanting. Another potential FD vector is the mosaic leafhopper, , commonly found in agroecosystems.
View Article and Find Full Text PDFBMC Genomics
December 2024
Department of Biological Sciences, University of Bergen, Bergen, N-5020, Norway.
Background: Fervidobacterium is a genus of thermophilic anaerobic Gram-negative rod-shaped bacteria belonging to the phylum Thermotogota. They can grow through fermentation on a wide range of sugars and protein-rich substrates. Some can also break down feather keratin, which has significant biotechnological potential.
View Article and Find Full Text PDFJ Chem Inf Model
December 2024
College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.
Enzymes are ubiquitous catalysts with enormous application potential in biomedicine, green chemistry, and biotechnology. However, accurately predicting whether a molecule serves as a substrate for a specific enzyme, especially for novel entities, remains a significant challenge. Compared with traditional experimental methods, computational approaches are much more resource-efficient and time-saving, but they often compromise on accuracy.
View Article and Find Full Text PDFCurr Protoc
December 2024
Department of Biomedical Sciences, University of Padova, Padova, Italy.
Intrinsically disordered proteins (IDPs) make up around 30% of eukaryotic proteomes and play a crucial role in cellular processes and in pathological conditions such as neurodegenerative disorders and cancers. However, IDPs exhibit dynamic conformational ensembles and are often involved in the formation of biomolecular condensates. Understanding the function of IDPs is critical to research in many areas of science.
View Article and Find Full Text PDFG3 (Bethesda)
December 2024
Departamento de Genética e Biologia Evolutiva, Instituto de Biociências, Universidade de São Paulo, Rua do Matão, 277, CEP 05508-090, São Paulo, SP, Brazil.
Tetrapedia diversipes is a Neotropical solitary bee commonly found in trap-nests, known for its morphological adaptations for floral oil collection and prepupal diapause during the cold and dry season. Here, we present the genome assembly of T. diversipes (332 Mbp), comprising 2,575 scaffolds, with 15,028 predicted protein-coding genes.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!