In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence.
Training Code AccessibilityApache 2.0 https://huggingface.co/BAAI/Emu3-Gen Apache 2.0 for inference code and SFT training https://github.com/baaivision/Emu3 pre-training code release is still in the to-do liest as of June 2025
Parameters8000000000