We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at this https URL.
Notes: 6 FLOP / token / parameter * 45.9 * 10^9 activated parameters * 7.2 * 10^12 tokens = 1.98288e+24 FLOP "Likely" confidence because the model is MoE (formula might not be that accurate) + I am not confidentely sure about dataset size
Size Notes: 7.2T tokens
Notes: Total Parameters: 456B Activated Parameters per Token: 45.9B Number Layers: 80 Hybrid Attention: a softmax attention is positioned after every 7 lightning attention. Number of attention heads: 64 Attention head dimension: 128 Mixture of Experts: Number of experts: 32 Expert hidden dimension: 9216 Top-2 routing strategy Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000 Hidden Size: 6144 Vocab Size: 200,064