Model Details

Domain:

Biology

Task:

Protein or nucleotide language model pLM

Protein or nucleotide language model nLM

Model Access:

Open weights (unrestricted)

Citations:

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Public protein sequence databases contain samples from the fitness landscape explored by nature. Protein language models (pLMs) pre-trained on these sequences aim to capture this landscape for tasks like property prediction and protein design. Following the same trend as in natural language processing, pLMs have continuously been scaled up. However, the premise that scale leads to better performance assumes that source databases provide accurate representation of the underlying fitness landscape, which is likely false. By developing an efficient codebase, designing a modern architecture, and addressing data quality concerns such as sample bias, we introduce AMPLIFY, a best-in-class pLM that is orders of magnitude less expensive to train and deploy than previous models. Furthermore, to support the scientific community and democratize the training of pLMs, we have open-sourced AMPLIFY’s pre-training codebase, data, and model checkpoints.

Benchmarking

FLOPs1.1e+22

Notes: 1. Hardware: A100 GPUs with 3.12×10¹⁴ FLOP/s per GPU (bf16/fp16) 2. Duration: Directly provided - 1,014 GPU days = 8.75×10⁷ seconds 3. Utilization: 40% 4. Calculation: 3.12×10¹⁴ FLOP/s × 8.75×10⁷ s × 0.40 = 1.09×10²² FLOPs

Training

Training Code AccessibilityMIT license https://github.com/chandar-lab/AMPLIFY MIT license https://huggingface.co/chandar-lab/AMPLIFY_350M_base

HardwareNVIDIA A100

Size Notes: Dataset: 391,000,000 sequences Sequence length: 512 tokens/sequence Total tokens = 391,000,000 × 512 = 2.00 × 10¹¹ tokens Steps processed = 1,000,000 × 4,096 = 4,096,000,000 sequences Number of epochs = 4,096,000,000 / 391,000,000 ≈ 10.47 Final result: 2.00 × 10¹¹ tokens

Parameters

Parameters350000000

Authors

Quentin Fournier, Robert M. Vernon, Almer van der Sloot, Benjamin Schulz, Sarath Chandar, Christopher James Langmead

Chandar Research Lab,Mila - Quebec AI (originally Montreal Institute for Learning Algorithms),Amgen,Polytechnique Montreal,CIFAR AI Research | AMPLIFY - Capabilities, Benchmarks and Use Cases

Model Details

Domain:

Biology

Task:

Protein or nucleotide language model pLM

Protein or nucleotide language model nLM

Model Access:

Open weights (unrestricted)

Citations:

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Benchmarking

FLOPs1.1e+22

Training

Training Code AccessibilityMIT license https://github.com/chandar-lab/AMPLIFY MIT license https://huggingface.co/chandar-lab/AMPLIFY_350M_base

HardwareNVIDIA A100

Parameters

Parameters350000000

Authors

Quentin Fournier, Robert M. Vernon, Almer van der Sloot, Benjamin Schulz, Sarath Chandar, Christopher James Langmead

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors