AI Article Synopsis

  • Protein language modeling is a new deep learning technique in bioinformatics that can predict protein structures and design proteins, but its use for estimating sequence conservation has not been widely explored.
  • This study introduces a method for assessing sequence conservation without the need for traditional alignments, utilizing sequence embeddings from protein language models, with ESM2 models showing the best performance versus computational cost.
  • The proposed method can analyze full-length proteins in one go, detect conserved functional sites in rapidly evolving regions, and is accessible for use via scripts provided at a specific GitHub link.

Article Abstract

Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However, application toward estimating sequence conservation for functional site prediction has not been systematically explored. Here, we present a method for the alignment-free estimation of sequence conservation using sequence embeddings generated from protein language models. Comprehensive benchmarks across publicly available protein language models reveal that ESM2 models provide the best performance to computational cost ratio for conservation estimation. Applying our method to full-length protein sequences, we demonstrate that embedding-based methods are not sensitive to the order of conserved elements-conservation scores can be calculated for multidomain proteins in a single run, without the need to separate individual domains. Our method can also identify conserved functional sites within fast-evolving sequence regions (such as domain inserts), which we demonstrate through the identification of conserved phosphorylation motifs in variable insert segments in protein kinases. Overall, embedding-based conservation analysis is a broadly applicable method for identifying potential functional sites in any full-length protein sequence and estimating conservation in an alignment-free manner. To run this on your protein sequence of interest, try our scripts at https://github.com/esbgkannan/kibby.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9851297PMC
http://dx.doi.org/10.1093/bib/bbac599DOI Listing

Publication Analysis

Top Keywords

sequence conservation
12
functional sites
12
protein sequence
12
protein language
12
protein
9
alignment-free estimation
8
sequence
8
estimation sequence
8
sequence embeddings
8
language models
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!