Public protein sequence databases contain samples from the fitness landscape explored by nature. Protein language models (pLMs) pre-trained on these sequences aim to capture this landscape for tasks like property prediction and protein design. Following the same trend as in natural language processing, pLMs have continuously been scaled up. However, the premise that scale leads to better performance assumes that source databases provide accurate representation of the underlying fitness landscape, which is likely false. By developing an efficient codebase, designing a modern architecture, and addressing data quality concerns such as sample bias, we introduce AMPLIFY, a best-in-class pLM that is orders of magnitude less expensive to train and deploy than previous models. Furthermore, to support the scientific community and democratize the training of pLMs, we have open-sourced AMPLIFY’s pre-training codebase, data, and model checkpoints.
Notes: 1. Hardware: A100 GPUs with 3.12×10¹⁴ FLOP/s per GPU (bf16/fp16) 2. Duration: Directly provided - 1,014 GPU days = 8.75×10⁷ seconds 3. Utilization: 40% 4. Calculation: 3.12×10¹⁴ FLOP/s × 8.75×10⁷ s × 0.40 = 1.09×10²² FLOPs
Size Notes: Dataset: 391,000,000 sequences Sequence length: 512 tokens/sequence Total tokens = 391,000,000 × 512 = 2.00 × 10¹¹ tokens Steps processed = 1,000,000 × 4,096 = 4,096,000,000 sequences Number of epochs = 4,096,000,000 / 391,000,000 ≈ 10.47 Final result: 2.00 × 10¹¹ tokens