Integrating genetic algorithms and language models for enhanced enzyme design.

Brief Bioinform

IBM Research Europe, Säumerstrasse 4, CH-8803 Rüschlikon, Switzerland.

Published: November 2024

Enzymes are molecular machines optimized by nature to allow otherwise impossible chemical processes to occur. Their design is a challenging task due to the complexity of the protein space and the intricate relationships between sequence, structure, and function. Recently, large language models (LLMs) have emerged as powerful tools for modeling and analyzing biological sequences, but their application to protein design is limited by the high cardinality of the protein space. This study introduces a framework that combines LLMs with genetic algorithms (GAs) to optimize enzymes. LLMs are trained on a large dataset of protein sequences to learn relationships between amino acid residues linked to structure and function. This knowledge is then leveraged by GAs to efficiently search for sequences with improved catalytic performance. We focused on two optimization tasks: improving the feasibility of biochemical reactions and increasing their turnover rate. Systematic evaluations on 105 biocatalytic reactions demonstrated that the LLM-GA framework generated mutants outperforming the wild-type enzymes in terms of feasibility in 90% of the instances. Further in-depth evaluation of seven reactions reveals the power of this methodology to make "the best of both worlds" and create mutants with structural features and flexibility comparable with the wild types. Our approach advances the state-of-the-art computational design of biocatalysts, ultimately opening opportunities for more sustainable chemical processes.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11711099PMC
http://dx.doi.org/10.1093/bib/bbae675DOI Listing

Publication Analysis

Top Keywords

genetic algorithms
8
language models
8
chemical processes
8
protein space
8
structure function
8
integrating genetic
4
algorithms language
4
models enhanced
4
enhanced enzyme
4
design
4

Similar Publications

T-helper 17 (Th17) cells significantly influence the onset and advancement of malignancies. This study endeavor focused on delineating molecular classifications and developing a prognostic signature grounded in Th17 cell differentiation-related genes (TCDRGs) using machine learning algorithms in head and neck squamous cell carcinoma (HNSCC). A consensus clustering approach was applied to The Cancer Genome Atlas-HNSCC cohort based on TCDRGs, followed by an examination of differential gene expression using the limma package.

View Article and Find Full Text PDF

Appraisal models, such as the Scherer's Component Process Model (CPM), represent an elegant framework for the interpretation of emotion processes, advocating for computational models that capture emotion dynamics. Today's emotion recognition research, however, typically classifies discrete qualities or categorised dimensions, neglecting the dynamic nature of emotional processes and thus limiting interpretability based on appraisal theory. In our research, we estimate emotion intensity from multiple physiological features associated to the CPM's neurophysiological component using dynamical models with the aim of bringing insights into the relationship between physiological dynamics and perceived emotion intensity.

View Article and Find Full Text PDF

Hypertension is a critical risk factor and cause of mortality in cardiovascular diseases, and it remains a global public health issue. Therefore, understanding its mechanisms is essential for treating and preventing hypertension. Gene expression data is an important source for obtaining hypertension biomarkers.

View Article and Find Full Text PDF

Background: Long COVID, a heterogeneous condition characterized by a range of physical and neuropsychiatric presentations, can be presented with a proportion of COVID-19-infected individuals.

Methods: Transcriptomic data sets of those within gene expression profiles of COVID-19, long COVID, and healthy controls were retrieved from the GEO database. Differentially expressed genes (DEGs) falling under COVID-19 and long COVID were identified with R packages, and contemporaneously conducted module detection was performed with the Modular Pharmacology Platform (http://112.

View Article and Find Full Text PDF

Preeclampsia (PE) is a pregnancy-specific disorder characterized by an unclearly understood pathogenesis and poses a great threat to maternal and fetal safety. Cuproptosis, a novel form of cellular death, has been implicated in the advancement of various diseases. However, the role of cuproptosis and immune-related genes in PE is unclear.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!