Model Details

Domain:

Task:

Model Access:

Open weights (unrestricted)

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

We believe that Context Length Scaling and Total Parameter Scaling are two major trends in the future of large models. To further improve training and inference efficiency under long-context and large-parameter settings, we design a brand-new model architecture called Qwen3-Next.Compared to the MoE structure of Qwen3, Qwen3-Next introduces several key improvements: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference. Based on this new architecture, we train the Qwen3-Next-80B-A3B-Base model — an 80-billion-parameter model that activates only 3 billion parameters during inference. This base model achieves performance comparable to (or even slightly better than) the dense Qwen3-32B model, while using less than 10% of its training cost (GPU hours). In inference, especially with context lengths over 32K tokens, it delivers more than 10x higher throughput — achieving extreme efficiency in both training and inference. We develop and release two post-trained versions based on Qwen3-Next-80B-A3B-Base: Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking. We solve the long-standing stability and efficiency issues in reinforcement learning (RL) training caused by the hybrid attention + high-sparsity MoE architecture. This led to improvements in both RL training speed and final performance.

Benchmarking

FLOPs2.7e+23

Notes: 6 FLOP / parameter / token * 3 * 10^9 active parameters * 15 * 10^12 pre-training tokens = 2.7e+23 FLOP "It uses less than 80% of the GPU hours needed by Qwen3-30A-3B, and only 9.3% of the compute cost of Qwen3-32B — while achieving better performance." (Qwen3-30A-3B compute estimation: 6.48e+23 FLOP, Qwen3-32B - 7.0848e+24 FLOP)

Training

Training Code AccessibilityApache 2.0 https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct

Size Notes: Training Stage: Pretraining (15T tokens) & Post-training "Qwen3-Next is trained on a uniformly sampled subset (15T tokens) of Qwen3’s 36T-token pretraining corpus"

Parameters

Parameters80000000000

Notes: 80B - A3B

Related ModelsView all models

Qwen3-Omni-30B-A3BBy Alibaba

Multimodal

Language

Vision+2

Wan 2.2 14B I2VBy Alibaba

Video

Vision

Wan 2.2 14B T2VBy Alibaba

Video

Qwen3-Coder-480B-A35BBy Alibaba

Language

Alibaba | Qwen3-Next-80B-A3B - Capabilities, Benchmarks and Use Cases

Model Details

Domain:

Task:

Model Access:

Open weights (unrestricted)

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Benchmarking

FLOPs2.7e+23

Training

Training Code AccessibilityApache 2.0 https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct

Size Notes: Training Stage: Pretraining (15T tokens) & Post-training "Qwen3-Next is trained on a uniformly sampled subset (15T tokens) of Qwen3’s 36T-token pretraining corpus"