Deep embedding and alignment of protein sequences.

Felipe Llinares-López Quentin Berthet Mathieu Blondel Olivier Teboul Jean-Philippe Vert

Nat Methods

Brain Team, Google Research, Paris, France.

Published: January 2023

Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.

Download full-text PDF	Source
http://dx.doi.org/10.1038/s41592-022-01700-2	DOI Listing

Publication Analysis

Top Keywords

protein sequences

deep embedding

sequence alignment

remote homologs

sequences

alignment

embedding alignment

protein

alignment protein

sequences protein

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!