AI Article Synopsis

  • Current methods for predicting bacterial gene functions are limited due to a lack of database matches for novel proteins, necessitating improved prediction techniques.
  • A new tool called SAFPred has been developed, utilizing protein embeddings from advanced language models to enhance gene function prediction in bacteria while incorporating their unique operon structures.
  • SAFPred demonstrated superior performance over traditional methods and identified 11 potential novel toxins in enterococci, which could have important health implications.

Article Abstract

Motivation: Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for bacteria. Recently, transformer-based language models-adopted from the natural language processing field-have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes.

Results: To predict gene functions in bacteria, we developed SAFPred, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAFpred also leverages the unique operon structure of bacteria through conserved synteny. SAFPred outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAFPred to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health.

Availability And Implementation: https://github.com/AbeelLab/safpred.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11147799PMC
http://dx.doi.org/10.1093/bioinformatics/btae328DOI Listing

Publication Analysis

Top Keywords

function prediction
16
gene function
12
protein embeddings
12
synteny-aware gene
8
gene functions
8
gene
6
safpred
5
function
5
bacteria
5
protein
5

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!