Mixtral 8x22B is our latest open model. It sets a new standard for performance and efficiency within the AI community. It is a sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B, offering unparalleled cost efficiency for its size. Mixtral 8x22B comes with the following strengths: - It is fluent in English, French, Italian, German, and Spanish - It has strong mathematics and coding capabilities - It is natively capable of function calling; along with the constrained output mode implemented on la Plateforme, this enables application development and tech stack modernisation at scale - Its 64K tokens context window allows precise information recall from large documents
Notes: Assuming the model was trained on ~1-10 trillions of tokens (same OOM as the models from the comparison in Figure 1. Llama 2 was trained on 2T tokens) + Mistral Small 3 was trained on 8T of tokens, we can estimate training compute with "speculative" confidence: 6 FLOP / token / parameter * 39 * 10^9 active parameters * 10*10^12 tokens [speculatively] = 2.34e+24 FLOP
Notes: 141B params, 39B active: https://mistral.ai/news/mixtral-8x22b/