Protein denoising diffusion probabilistic models are used for the de novo generation of protein backbones but are limited in their ability to guide generation of proteins with sequence-specific attributes and functional properties. To overcome this limitation, we developed ProteinGenerator (PG), a sequence space diffusion model based on RoseTTAFold that simultaneously generates protein sequences and structures. Beginning from a noised sequence representation, PG generates sequence and structure pairs by iterative denoising, guided by desired sequence and structural protein attributes.
View Article and Find Full Text PDFDirected evolution of proteins has been the most effective method for protein engineering. However, a new paradigm is emerging, fusing the library generation and screening approaches of traditional directed evolution with computation through the training of machine learning models on protein sequence fitness data. This chapter highlights successful applications of machine learning to protein engineering and directed evolution, organized by the improvements that have been made with respect to each step of the directed evolution cycle.
View Article and Find Full Text PDFWidespread availability of protein sequence-fitness data would revolutionize both our biochemical understanding of proteins and our ability to engineer them. Unfortunately, even though thousands of protein variants are generated and evaluated for fitness during a typical protein engineering campaign, most are never sequenced, leaving a wealth of potential sequence-fitness information untapped. Primarily, this is because sequencing is unnecessary for many protein engineering strategies; the added cost and effort of sequencing are thus unjustified.
View Article and Find Full Text PDFDirected evolution of proteins often involves a greedy optimization in which the mutation in the highest-fitness variant identified in each round of single-site mutagenesis is fixed. The efficiency of such a single-step greedy walk depends on the order in which beneficial mutations are identified-the process is path dependent. Here, we investigate and optimize a path-independent machine learning-assisted directed evolution (MLDE) protocol that allows in silico screening of full combinatorial libraries.
View Article and Find Full Text PDFMachine learning (ML) can expedite directed evolution by allowing researchers to move expensive experimental screens in silico. Gathering sequence-function data for training ML models, however, can still be costly. In contrast, raw protein sequence data is widely available.
View Article and Find Full Text PDFWhile biocatalysis is increasingly incorporated into drug development pipelines, it is less commonly used in the early stages of drug discovery. By engineering a protein to produce a chiral motif with a derivatizable functional handle, biocatalysts can be used to help generate diverse building blocks for drug discovery. Here we show the engineering of two variants of nitric oxide dioxygenase (NOD) to catalyze the formation of - and diastereomers of a pinacolboronate-substituted cyclopropane which can be readily derivatized to generate diverse stereopure cyclopropane building blocks.
View Article and Find Full Text PDFProc Natl Acad Sci U S A
April 2019
To reduce experimental effort associated with directed protein evolution and to explore the sequence space encoded by mutating multiple positions simultaneously, we incorporate machine learning into the directed evolution workflow. Combinatorial sequence space can be quite expensive to sample experimentally, but machine-learning models trained on tested variants provide a fast method for testing sequence space computationally. We validated this approach on a large published empirical fitness landscape for human GB1 binding protein, demonstrating that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches.
View Article and Find Full Text PDFAcaryochloris species are a genus of cyanobacteria that utilize chlorophyll (chl) d as their primary chlorophyll molecule during oxygenic photosynthesis. Chl d allows Acaryochloris to harvest red-shifted light, which gives them the ability to live in filtered light environments that are depleted in visible light. Although genomes of multiple Acaryochloris species have been sequenced, their analysis has not revealed how chl d is synthesized.
View Article and Find Full Text PDF