Model Details

Domain:

Biology

Task:

Gene expression enhancement

Gene expression profile generation

Molecular property prediction

Mutation prediction

Protein-DNA binding prediction

Transcriptomic prediction

RNA structure prediction

Model Access:

API access

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Deep learning models that predict functional genomic measurements from DNA sequence are powerful tools for deciphering the genetic regulatory code. Existing methods trade off between input sequence length and prediction resolution, thereby limiting their modality scope and performance. We present AlphaGenome, which takes as input 1 megabase of DNA sequence and predicts thousands of functional genomic tracks up to single base pair resolution across diverse modalities – including gene expression, transcription initiation, chromatin accessibility, histone modifications, transcription factor binding, chro- matin contact maps, splice site usage, and splice junction coordinates and strength. Trained on human and mouse genomes, AlphaGenome matches or exceeds the strongest respective available external models on 24 out of 26 evaluations on variant effect prediction. AlphaGenome’s ability to simultaneously score variant effects across all modalities accurately recapitulates the mechanisms of clinically-relevant variants near the TAL1 oncogene. To facilitate broader use, we provide tools for making genome track and variant effect predictions from sequence.

Benchmarking

FLOPs1.36e+22

Notes: Pre-training: "Each gradient step processed a batch size of 64 samples using 8-way sequence parallelism, requiring 512 TPUv3 cores, with pre-training runs typically completing in approximately 4 hours." "Distillation using many teacher models (e.g., 64; orange crosses)" 123000000000000 FLOP / TPUv3 chip / sec * (512 TPUv3 cores / 2) * 4 hours * 3600 sec / hour * 0.3 [assumed utilization] * 64 training runs ["Likely" confidence] = 8.7058022e+21 FLOP Distillation: "Distillation training was performed without sequence parallelism across 64 NVIDIA H100 GPUs, with a batch size of 64 (effectively one sample per GPU). Each GPU loaded a different frozen teacher model from the pool of 64 pretrained all-folds models. <..> The distillation process ran for 250,000 steps, taking approximately 3 days" 989400000000000 FLOP / GPU / sec * 64 GPUs * 3 days * 24 hours / day * 3600 sec / hour * 0.3 [assumed utilization] = 4.9238876e+21 FLOP 8.7058022e+21 FLOP + 4.9238876e+21 FLOP = 1.362969e+22 FLOP "Likely" confidence, because I am not sure how many pre-training runs were there.

Training

Training Code Accessibility"To advance scientific research, we’re making AlphaGenome available in preview via our AlphaGenome API for non-commercial research, and planning to release the model in the future."

HardwareGoogle TPU v3,NVIDIA H100 SXM5 80GB

Parameters

Parameters450000000

Notes: " AlphaGenome has approximately 450 million trainable parameters (20% in the encoder, 28% in the sequence transformer, 15% in the pairwise blocks, 25% in the decoder, and 12% in the output embedding and prediction heads)"

Authors

Žiga Avsec, Natasha Latysheva, Jun Cheng, Guido Novati, Kyle R. Taylor, Tom Ward, Clare Bycroft, Lauren Nicolaisen, Eirini Arvaniti, Joshua Pan, Raina Thomas, Vincent Dutordoir, Matteo Perino, Soham De, Alexander Karollus, Adam Gayoso, Toby Sargeant, Anne Mottram, Lai Hong Wong, Pavol Drotár, Adam Kosiorek, Andrew Senior, Richard Tanburn, Taylor Applebaum, Souradeep Basu, Demis Hassabis, Pushmeet Kohli

Related ModelsView all models