Model Details

Domain:

Biology

Task:

Proteins

Protein or nucleotide language model pLM

Protein or nucleotide language model nLM

Protein folding prediction

Model Access:

Open weights (unrestricted)

Citations:

636

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

"Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins."

Benchmarking

FLOPs7.56e+21

Notes: from xTrimoPGLM paper Table 9 (https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1): 4.4e21 FLOP from the paper's Supplementary Materials: "We trained each model over 512 NVIDIA V100 GPUs. ESM2 700M took 8 days to train. The 3B parameter LM took 30 days. The 15B model took 60 days." 8 days x 512 V100s x an imputed 30% utilization": 1.3e22 FLOP Geometric mean: 7.56e21 FLOP

Training

Training Code AccessibilityMIT weights, CC BY 4.0 data https://github.com/facebookresearch/esm?tab=readme-ov-file#available-esmssd

HardwareNVIDIA V100

Hardware Quantity512

Size Notes: Section A.1.1: "This allowed ESM-2 models to train on over 60M protein sequences." Average protein sequence is 200 tokens, per https://epoch.ai/blog/biological-sequence-models-in-the-context-of-the-ai-directives#fn:4 60M * 200 = 12B tokens Epochs: Used 500k steps at 2M token batch size 500k * 2M / 12B = 83.3

Parameters

Parameters650000000

Notes: In the name

Authors

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, Alexander Rives

Related ModelsView all models