The use of enzymes for organic synthesis allows for simplified, more economical and selective synthetic routes not accessible to conventional reagents. However, predicting whether a particular molecule might undergo a specific enzyme transformation is very difficult. Here we used multi-task transfer learning to train the molecular transformer, a sequence-to-sequence machine learning model, with one million reactions from the US Patent Office (USPTO) database combined with 32 181 enzymatic transformations annotated with a text description of the enzyme. The resulting enzymatic transformer model predicts the structure and stereochemistry of enzyme-catalyzed reaction products with remarkable accuracy. One of the key novelties is that we combined the reaction SMILES language of only 405 atomic tokens with thousands of human language tokens describing the enzymes, such that our enzymatic transformer not only learned to interpret SMILES, but also the natural language as used by human experts to describe enzymes and their mutations.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8246114 | PMC |
http://dx.doi.org/10.1039/d1sc02362d | DOI Listing |
Biophys J
January 2025
Department of Machine Learning, Moffitt Cancer Center, Tampa, Florida, United States. Electronic address:
In the field of drug discovery, the generation of new molecules with desirable properties remains a critical challenge. Traditional methods often rely on SMILES (Simplified Molecular Input Line Entry System) representations for molecular input data, which can limit the diversity and novelty of generated molecules. To address this, we present the Transformer Graph Variational Autoencoder (TGVAE), an innovative AI model that employs molecular graphs as input data, thus captures the complex structural relationships within molecules more effectively than string models.
View Article and Find Full Text PDFSci Data
January 2025
Institut Sophia Agrobiotech, INRAE, Université Côte d'Azur, CNRS, 400 routes des Chappes, 06903, Sophia-Antipolis, France.
Root-knot nematodes (RKN) of the genus Meloidogyne are obligatory plant endoparasites that cause substantial economic losses to agricultural production and impact the global food supply. These plant parasitic nematodes belong to the most widespread and devastating genus worldwide, yet few measures of control are available. The most efficient way to control RKN is deployment of resistance genes in plants.
View Article and Find Full Text PDFJ Chem Inf Model
January 2025
Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing, Consiglio Nazionale delle Ricerche, Via G. Amendola, 122/d, Bari 70126, Italy.
The drug discovery process can be significantly accelerated by using deep learning methods to suggest molecules with druglike features and, more importantly, that are good candidates to bind specific proteins of interest. We present a novel deep learning generative model, Prot2Drug, that learns to generate ligands binding specific targets leveraging (i) the information carried by a pretrained protein language model and (ii) the ability of transformers to capitalize the knowledge gathered from thousands of protein-ligand interactions. The embedding unveils the receipt to follow for designing molecules binding a given protein, and Prot2Drug translates such instructions by using the syntax of the molecular language generating novel compounds which are predicted to have favorable physicochemical properties and high affinity toward specific targets.
View Article and Find Full Text PDFCurr Opin Struct Biol
January 2025
Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA; Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011, USA. Electronic address:
There is an ever-increasing need for accurate and efficient methods to identify protein homologs. Traditionally, sequence similarity-based methods have dominated protein homolog identification for function identification, but these struggle when the sequence identity between the pairs is low. Recently, transformer architecture-based deep learning methods have achieved breakthrough performances in many fields.
View Article and Find Full Text PDFSTAR Protoc
January 2025
Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115 Bonn, Germany; Lamarr Institute for Machine Learning and Artificial Intelligence, Friedrich-Hirzebruch-Allee 5/6, 53115 Bonn, Germany. Electronic address:
Here, we present a protocol to generate dual-target compounds (DT-CPDs) interacting with two distinct target proteins using a transformer-based chemical language model. We describe steps for installing software, preparing data, and pre-training the model on pairs of single-target compounds (ST-CPDs), which bind to an individual protein, and DT-CPDs. We then detail procedures for assembling ST- and corresponding DT-CPD data for specific protein pairs and evaluating the model's performance on hold-out test sets.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!