More than three billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here we show that language models trained at scale on evolutionary data can generate functional proteins that are far away from known proteins. We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins.
View Article and Find Full Text PDFA broad spectrum of metagenomic and single cell sequencing techniques have become popular for dissecting environmental microbial diversity, leading to the characterization of thousands of novel microbial lineages. In addition to recovering bacterial and archaeal genomes, metagenomic assembly can also produce genomes of viruses that infect microbial cells. Because of their diversity, lack of marker genes, and small genome size, identifying novel bacteriophage sequences from metagenomic data is often challenging, especially when the objective is to establish phage-host relationships.
View Article and Find Full Text PDF