ProGen2: Exploring the boundaries of protein language models.

Cell Syst

Salesforce Research, Palo Alto, CA, USA; Profluent Bio, Berkeley, CA, USA. Electronic address:

Published: November 2023

Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.cels.2023.10.002DOI Listing

Publication Analysis

Top Keywords

protein
9
protein language
8
language models
8
protein sequences
8
models
6
progen2 exploring
4
exploring boundaries
4
boundaries protein
4
models attention-based
4
attention-based models
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!