In this report, we introduce ERNIE 4.5, a new family of large-scale multimodal models comprising 10 distinct variants. The model family consist of Mixture-of-Experts (MoE) models with 47B and 3B active parameters, with the largest model having 424B total parameters, as well as a 0.3B dense model. For the MoE architecture, we propose a novel heterogeneous modality structure, which supports parameter sharing across modalities while also allowing dedicated parameters for each individual modality. This MoE architecture has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks. All of our models are trained with optimal efficiency using the PaddlePaddle deep learning framework, which also enables high-performance inference and streamlined deployment for them. We achieve 47% Model FLOPs Utilization (MFU) in our largest ERNIE 4.5 language model pre-training. Experimental results show that our models achieve state-of-the-art performance across multiple text and multimodal benchmarks, especially in instruction following, world knowledge memorization, visual understanding and multimodal reasoning. All models are publicly accessible under Apache 2.0 to support future research and development in the field. Additionally, we open source the development toolkits for ERNIE 4.5, featuring industrial-grade capabilities, resourceefficient training and inference workflows, and multi-hardware compatibility.
Notes: They say the model was trained on "trillions of tokens" Speculatively assuming 10T tokens: 6 FLOP / token / parameter * 47 * 10^9 active parameters * 10 * 10^12 assumed training tokens = 2.82e+24 FLOP
Size Notes: "We commence with large-scale pre-training on trillions of pure-text tokens sourced from diverse domains."
Notes: MoE total parameters: 300B active parameters: 47B