We believe that Context Length Scaling and Total Parameter Scaling are two major trends in the future of large models. To further improve training and inference efficiency under long-context and large-parameter settings, we design a brand-new model architecture called Qwen3-Next.Compared to the MoE structure of Qwen3, Qwen3-Next introduces several key improvements: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference. Based on this new architecture, we train the Qwen3-Next-80B-A3B-Base model — an 80-billion-parameter model that activates only 3 billion parameters during inference. This base model achieves performance comparable to (or even slightly better than) the dense Qwen3-32B model, while using less than 10% of its training cost (GPU hours). In inference, especially with context lengths over 32K tokens, it delivers more than 10x higher throughput — achieving extreme efficiency in both training and inference. We develop and release two post-trained versions based on Qwen3-Next-80B-A3B-Base: Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking. We solve the long-standing stability and efficiency issues in reinforcement learning (RL) training caused by the hybrid attention + high-sparsity MoE architecture. This led to improvements in both RL training speed and final performance.
Notes: 6 FLOP / parameter / token * 3 * 10^9 active parameters * 15 * 10^12 pre-training tokens = 2.7e+23 FLOP "It uses less than 80% of the GPU hours needed by Qwen3-30A-3B, and only 9.3% of the compute cost of Qwen3-32B — while achieving better performance." (Qwen3-30A-3B compute estimation: 6.48e+23 FLOP, Qwen3-32B - 7.0848e+24 FLOP)
Size Notes: Training Stage: Pretraining (15T tokens) & Post-training "Qwen3-Next is trained on a uniformly sampled subset (15T tokens) of Qwen3’s 36T-token pretraining corpus"
Notes: 80B - A3B