Deep learning models that predict functional genomic measurements from DNA sequence are powerful tools for deciphering the genetic regulatory code. Existing methods trade off between input sequence length and prediction resolution, thereby limiting their modality scope and performance. We present AlphaGenome, which takes as input 1 megabase of DNA sequence and predicts thousands of functional genomic tracks up to single base pair resolution across diverse modalities – including gene expression, transcription initiation, chromatin accessibility, histone modifications, transcription factor binding, chro- matin contact maps, splice site usage, and splice junction coordinates and strength. Trained on human and mouse genomes, AlphaGenome matches or exceeds the strongest respective available external models on 24 out of 26 evaluations on variant effect prediction. AlphaGenome’s ability to simultaneously score variant effects across all modalities accurately recapitulates the mechanisms of clinically-relevant variants near the TAL1 oncogene. To facilitate broader use, we provide tools for making genome track and variant effect predictions from sequence.
FLOPs1.36e+22
Notes: Pre-training: "Each gradient step processed a batch size of 64 samples using 8-way sequence parallelism, requiring 512 TPUv3 cores, with pre-training runs typically completing in approximately 4 hours." "Distillation using many teacher models (e.g., 64; orange crosses)" 123000000000000 FLOP / TPUv3 chip / sec * (512 TPUv3 cores / 2) * 4 hours * 3600 sec / hour * 0.3 [assumed utilization] * 64 training runs ["Likely" confidence] = 8.7058022e+21 FLOP Distillation: "Distillation training was performed without sequence parallelism across 64 NVIDIA H100 GPUs, with a batch size of 64 (effectively one sample per GPU). Each GPU loaded a different frozen teacher model from the pool of 64 pretrained all-folds models. <..> The distillation process ran for 250,000 steps, taking approximately 3 days" 989400000000000 FLOP / GPU / sec * 64 GPUs * 3 days * 24 hours / day * 3600 sec / hour * 0.3 [assumed utilization] = 4.9238876e+21 FLOP 8.7058022e+21 FLOP + 4.9238876e+21 FLOP = 1.362969e+22 FLOP "Likely" confidence, because I am not sure how many pre-training runs were there.
Training Code Accessibility"To advance scientific research, we’re making AlphaGenome available in preview via our AlphaGenome API for non-commercial research, and planning to release the model in the future."
HardwareGoogle TPU v3,NVIDIA H100 SXM5 80GB
Parameters450000000
Notes: " AlphaGenome has approximately 450 million trainable parameters (20% in the encoder, 28% in the sequence transformer, 15% in the pairwise blocks, 25% in the decoder, and 12% in the output embedding and prediction heads)"