Protein language models can capture protein quaternary state.

Orly Avraham Tomer Tsaban Ziv Ben-Aharon Linoy Tsaban Ora Schueler-Furman

BMC Bioinformatics

Department of Microbiology and Molecular Genetics, Faculty of Medicine, Institute for Biomedical Research Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, Israel.

Published: November 2023

Determining a protein's quaternary state is vital for understanding its function, as many proteins need to form multimers to be active.
Advances in deep learning models like ESM-2 enable the prediction of protein characteristics from sequences, which may also reveal insights about quaternary states.
The newly developed model, QUEEN, shows promise in distinguishing between multimers and monomers, indicating that protein sequence embeddings contain valuable information for quaternary state prediction, even if it isn't as effective as traditional structural methods.

Background: Determining a protein's quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determination can be challenging and require extensive work. To complement these efforts, a number of computational tools have been developed for quaternary state prediction, often utilizing experimentally validated structural information. Recently, dramatic advances have been made in the field of deep learning for predicting protein structure and other characteristics. Protein language models, such as ESM-2, that apply computational natural-language models to proteins successfully capture secondary structure, protein cell localization and other characteristics, from a single sequence. Here we hypothesize that information about the protein quaternary state may be contained within protein sequences as well, allowing us to benefit from these novel approaches in the context of quaternary state prediction.

Results: We generated ESM-2 embeddings for a large dataset of proteins with quaternary state labels from the curated QSbio dataset. We trained a model for quaternary state classification and assessed it on a non-overlapping set of distinct folds (ECOD family level). Our model, named QUEEN (QUaternary state prediction using dEEp learNing), performs worse than approaches that include information from solved crystal structures. However, it successfully learned to distinguish multimers from monomers, and predicts the specific quaternary state with moderate success, better than simple sequence similarity-based annotation transfer. Our results demonstrate that complex, quaternary state related information is included in such embeddings.

Conclusions: QUEEN is the first to investigate the power of embeddings for the prediction of the quaternary state of proteins. As such, it lays out strengths as well as limitations of a sequence-based protein language model approach, compared to structure-based approaches. Since it does not require any structural information and is fast, we anticipate that it will be of wide use both for in-depth investigation of specific systems, as well as for studies of large sets of protein sequences. A simple colab implementation is available at: https://colab.

Research: google.com/github/Furman-Lab/QUEEN/blob/main/QUEEN_prediction_notebook.ipynb .

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10647083	PMC
http://dx.doi.org/10.1186/s12859-023-05549-w	DOI Listing

Publication Analysis

Top Keywords

quaternary state

protein language

quaternary

state

protein

language models

protein quaternary

state prediction

deep learning

protein sequences

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!