Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models.
Notes: 6 FLOP/parameter/token * 14000000000 active parameters * 11328000000000 tokens = 9.51552e+23 FLOP 989000000000000 FLOP/GPU/sec [H800 assumed] * 1456000 GPU-hours * 3600 sec / hour * 0.3 [assumed utilization] = 1.55518272e+24 FLOP sqrt(9.51552e+23*1.55518272e+24) = 1.2164856e+24
Size Notes: pre-training: "11.2T high-quality tokens" long context: 128B tokens 11.328T tokens total
Notes: a large-scale MoE model that activates 14 billion parameters out of a total of 142 billion parameters