>
Notes: Arithmetic calculation: 6 * 15T tokens * 70B parameters = 6.3e24 GPU calculation: https://huggingface.co/meta-llama/Meta-Llama-3-70B indicates training took 6.4M GPU-hours We also know their larger scale training runs for 405B were getting between 0.38-0.41 MFU. Presumably the 70B model gets at least 0.43 utilization (405B has to be split across two nodes, while 70B should fit on one). 990 TFLOPS per GPU * 6.4 million GPU hours * 3600s * 0.43 = 9.808e24 Geometric mean: sqrt(6.3e24 * 9.808e24) = 7.861e24