Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
FLOPs1.22e+24
Notes: Huggingface page says 3.1-8B used 1.46M H100 hours and trained over 15T tokens. https://huggingface.co/meta-llama/Llama-3.1-70B The paper also says that 3.1-405B got MFU of between 38-43%; presumably 8B was around the same or a bit higher. I'll assume utilization of 40% 6ND: 6 * 15T * 8B = 7.2e23 FLOPs Hardware: 1.46M * 9.9e14 * 3600 * 0.4 = 2.08e24 FLOPs Geometric mean: sqrt(7.2e23 * 2.08e24) = 1.224e24 Note that Llama 3-8B also said it used 15T tokens, but only 1.3M H100 hours. This suggests 3.1 might have used a bit more than 15T tokens.
Training Code AccessibilityLlama 3.1 license: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE must seek separate license if over 700m monthly users, acceptable use restrictions code here: https://github.com/meta-llama/llama3/tree/main
Parameters8000000000
Notes: 8B