Protein domain embeddings for fast and accurate similarity search.

Genome Res

Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA

Published: October 2024

Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation (DCT) to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins; however, limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins with single domains but not multidomain proteins. Here, we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the DCT to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, uses predicted contact maps from ESM-2 for domain segmentation, which is formulated as a problem and can be solved using a algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We show such domain-level contextual vectors (termed as ) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark show that the DCTdomain is able to detect distant homologs by leveraging the structural information in the contextual embeddings.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529836PMC
http://dx.doi.org/10.1101/gr.279127.124DOI Listing

Publication Analysis

Top Keywords

fast accurate
8
contextual embeddings
8
enable fast
8
domain-level contextual
8
contextual vectors
8
domain segmentation
8
embeddings
7
protein
5
proteins
5
protein domain
4

Similar Publications

Regulation of Dopamine Release by Tonic Activity Patterns in the Striatal Brain Slice.

ACS Chem Neurosci

January 2025

Departments of Psychiatry and Neurology, Division of Molecular Therapeutics, New York State Psychiatric Institute, Columbia University Medical Center, New York, New York 10032, United States.

Voluntary movement, motivation, and reinforcement learning depend on the activity of ventral midbrain neurons, which extend axons to release dopamine (DA) in the striatum. These neurons exhibit two patterns of action potential activity: low-frequency tonic activity that is intrinsically generated and superimposed high-frequency phasic bursts that are driven by synaptic inputs. acute striatal brain preparations are widely employed to study the regulation of evoked DA release but exhibit very different DA release kinetics than recordings.

View Article and Find Full Text PDF

Engineering Acid-Promoted Two-Photon Ratiometric Nanoprobes for Evaluating HClO in Lysosomes and Inflammatory Bowel Disease.

ACS Appl Mater Interfaces

January 2025

Anhui Provincial Key Laboratory of Biomedical Materials and Chemical Measurement, Laboratory of Functionalized Molecular Solids, Ministry of Education, College of Chemistry and Materials Science, Anhui Normal University, Wuhu 241002, P. R. China.

HClO is considered a potential contributing factor and biomarker of inflammatory bowel disease (IBD). Accurate monitoring of lysosomal HClO is important for further developing specific diagnostic and therapeutic schedules for IBD. However, only rare types of fluorescent probes have been reported for detecting HClO in IBD so far.

View Article and Find Full Text PDF

Accurate 6D object pose estimation is critical for autonomous docking. To address the inefficiencies and inaccuracies associated with maximal cliques-based pose estimation methods, we propose a fast 6D pose estimation algorithm that integrates feature space and space compatibility constraints. The algorithm reduces the graph size by employing Laplacian filtering to resample high-frequency signal nodes.

View Article and Find Full Text PDF

The use of hydrogen as fuel presents many safety challenges due to its flammability and explosive nature, combined with its lack of color, taste, and odor. The purpose of this paper is to present an electrochemical sensor that can achieve rapid and accurate detection of hydrogen leakage. This paper presents both the component elements of the sensor, like sensing material, sensing element, and signal conditioning, as well as the electronic protection and signaling module of the critical concentrations of H.

View Article and Find Full Text PDF

Autonomous driving has demonstrated impressive driving capabilities, with behavior decision-making playing a crucial role as a bridge between perception and control. Imitation Learning (IL) and Reinforcement Learning (RL) have introduced innovative approaches to behavior decision-making in autonomous driving, but challenges remain. On one hand, RL's policy networks often lack sufficient reasoning ability to make optimal decisions in highly complex and stochastic environments.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!