FLOPs7.861e+24
Notes: Arithmetic calculation: 6 * 15T tokens * 70B parameters = 6.3e24 GPU calculation: https://huggingface.co/meta-llama/Meta-Llama-3-70B indicates training took 6.4M GPU-hours We also know their larger scale training runs for 405B were getting between 0.38-0.41 MFU. Presumably the 70B model gets at least 0.43 utilization (405B has to be split across two nodes, while 70B should fit on one). 990 TFLOPS per GPU * 6.4 million GPU hours * 3600s * 0.43 = 9.808e24 Geometric mean: sqrt(6.3e24 * 9.808e24) = 7.861e24
Training Code Accessibilityhttps://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md License A custom commercial license is available at: https://llama.meta.com/llama3/license
Training DatasetLlama 3 dataset
Dataset Size15000000000000
HardwareNVIDIA H100 SXM5 80GB
Dataset Notes: Llama 3 is pretrained on over 15T tokens that were all collected from publicly available sources. Our training dataset is seven times larger than that used for Llama 2, and it includes four times more code. To prepare for upcoming multilingual use cases, over 5% of the Llama 3 pretraining dataset consists of high-quality non-English data that covers over 30 languages.
Parameters70000000000