Embeddings from deep learning transfer GO annotations beyond homology.

Sci Rep

Department of Informatics, Bioinformatics and Computational Biology, i12, TUM (Technical University of Munich), Boltzmannstr. 3, Garching, 85748, Munich, Germany.

Published: January 2021

AI Article Synopsis

  • Understanding protein functions is essential for advancements in molecular and medical biology, but less than 0.5% of known proteins have experimental function annotations through Gene Ontology (GO).
  • The study introduces a novel method for predicting GO terms using SeqVec embeddings derived from deep learning models, focusing on the proximity of proteins rather than their direct sequences.
  • Results show that this method achieves competitive performance in the CAFA3 assessments, particularly excelling in annotating proteins with low sequence similarity, suggesting significant potential for protein annotation improvements.

Article Abstract

Knowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we propose predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originate from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 33 million protein sequences. Replicating the conditions of CAFA3, our method reaches an F of 37 ± 2%, 50 ± 3%, and 57 ± 2% for BPO, MFO, and CCO, respectively. Numerically, this appears close to the top ten CAFA3 methods. When restricting the annotation transfer to proteins with < 20% pairwise sequence identity to the query, performance drops (F BPO 33 ± 2%, MFO 43 ± 3%, CCO 53 ± 2%); this still outperforms naïve sequence-based transfer. Preliminary results from CAFA4 appear to confirm these findings. Overall, this new concept is likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7806674PMC
http://dx.doi.org/10.1038/s41598-020-80786-0DOI Listing

Publication Analysis

Top Keywords

annotation transfer
12
protein sequences
8
embeddings deep
4
deep learning
4
transfer
4
learning transfer
4
transfer annotations
4
annotations homology
4
homology knowing
4
knowing protein
4

Similar Publications

Flavescence dorée (FD) poses a significant threat to grapevine health, with the American grapevine leafhopper, , serving as the primary vector. FD is responsible for yield losses and high production costs due to mandatory insecticide treatments, infected plant uprooting, and replanting. Another potential FD vector is the mosaic leafhopper, , commonly found in agroecosystems.

View Article and Find Full Text PDF

Background: Fervidobacterium is a genus of thermophilic anaerobic Gram-negative rod-shaped bacteria belonging to the phylum Thermotogota. They can grow through fermentation on a wide range of sugars and protein-rich substrates. Some can also break down feather keratin, which has significant biotechnological potential.

View Article and Find Full Text PDF

Deep Learning-Driven Insights into Enzyme-Substrate Interaction Discovery.

J Chem Inf Model

December 2024

College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.

Enzymes are ubiquitous catalysts with enormous application potential in biomedicine, green chemistry, and biotechnology. However, accurately predicting whether a molecule serves as a substrate for a specific enzyme, especially for novel entities, remains a significant challenge. Compared with traditional experimental methods, computational approaches are much more resource-efficient and time-saving, but they often compromise on accuracy.

View Article and Find Full Text PDF

Intrinsically disordered proteins (IDPs) make up around 30% of eukaryotic proteomes and play a crucial role in cellular processes and in pathological conditions such as neurodegenerative disorders and cancers. However, IDPs exhibit dynamic conformational ensembles and are often involved in the formation of biomolecular condensates. Understanding the function of IDPs is critical to research in many areas of science.

View Article and Find Full Text PDF

The genome of the solitary bee Tetrapedia diversipes (Hymenoptera, Apidae).

G3 (Bethesda)

December 2024

Departamento de Genética e Biologia Evolutiva, Instituto de Biociências, Universidade de São Paulo, Rua do Matão, 277, CEP 05508-090, São Paulo, SP, Brazil.

Tetrapedia diversipes is a Neotropical solitary bee commonly found in trap-nests, known for its morphological adaptations for floral oil collection and prepupal diapause during the cold and dry season. Here, we present the genome assembly of T. diversipes (332 Mbp), comprising 2,575 scaffolds, with 15,028 predicted protein-coding genes.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!