We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.
Notes: GPU method: Table 2 shows that 34B model pre-training uses 4282407 GPU-hours, trained across 3072 A100s. 3.12e14 * 4282407 * 3600 * 0.3 = 1.44e24 Parameter-token method: Pre-training goes over 9.2T tokens, post-training only goes over 1.1B tokens (sum of tokens column in Table 3). 6 * 34B * 9.2T = 1.88e24 Geometric mean: sqrt(1.44e24 * 1.88e24) = 1.65e24
Size Notes: Slightly conflicting info. Pre-training data details describe different types of data that sum to 4.8 trillion tokens, but Table 1 indicates 4.4T. Using table values as this agrees with other statements about epochs and total tokens seen.