"Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins."
Notes: from xTrimoPGLM paper Table 9 (https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1): 5.1e22 FLOP from Arb Research (https://arbresearch.com/files/gen_bio.pdf): "ESM-2-15B: 270000 updates x 3.2M batch size x 15 B “connections” x 6. : 7.8e22 FLOP from the paper's Supplementary Materials: "We trained each model over 512 NVIDIA V100 GPUs. ESM2 700M took 8 days to train. The 3B parameter LM took 30 days. The 15B model took 60 days." 60 days x 512 V100s x an imputed 30% utilization": 1e23 FLOP Geometric mean: 7.35e22
Size Notes: Section A.1.1: "This allowed ESM-2 models to train on over 60M protein sequences." Average protein sequence is 200 tokens, per https://epoch.ai/blog/biological-sequence-models-in-the-context-of-the-ai-directives#fn:4 60M * 200 = 12B tokens Epochs: 15B model used 270k steps at 3.2M token batch size 270k * 3.2M / 12B = 72
Notes: "we train models up to 15B parameters"