Model Details

Domain:

Biology

Task:

Proteins

Protein or nucleotide language model pLM

Protein or nucleotide language model nLM

Protein contact and distance prediction

Protein folding prediction

Model Access:

Open weights (unrestricted)

Citations:

630

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

Benchmarking

FLOPs5.49e+21

Notes: Based on: https://docs.google.com/spreadsheets/d/1enan21dFx03TkwufHgOwTVNBtuYlqNY9uurjIK6YS-8/edit#gid=0 Number of steps 4.5e5, batch size (tokens) 6.1e7, parameters 1e8 Calculation = 4e8 FLOP/bp * 4.5e5 bp + 2e8 FLOP/fp * 2.75e13 fp Batch size: 512 Seq length: 100 * 1192 tokens All models are trained on 32 V100 GPUs for 100k updates. The four models with best contact precision are then further trained to 150k updates. Finally, the best model at 150k updates is trained to 450k updates. 450k * 512 * 100 * 1192 * 100M * 6 = 1.65e22

Training

Training Code AccessibilityMIT: https://github.com/facebookresearch/esm looks like no training code

HardwareNVIDIA Tesla V100 DGXS 32 GB

Hardware Quantity32

Size Notes: "We train an MSA Transformer model with 100M parameters on a large dataset (4.3 TB) of 26 million MSAs, with an average of 1192 sequences per MSA." Average sequence is ~300 amino acids/tokens long. 26 million * 1192 * 300 = 9.3T tokens