Notes: Estimations for 82B model (marked as lower bound estimations) "For experiments in Section 4, the model trained with 150B is used for fair comparison, because not all models are finished training at the same iteration. However, experiments in Section 5.2 use the model trained with 300B tokens, as HyperCLOVA Studio provided the 39B and 82B models trained with 300B tokens." 82e9 connections * 2 FLOP/connection * 300e9 tokens * 3 backward pass = 1.476e23 FLOP Calculation using GPU time corroborates this: - "Our model is based on megatron-LM (Shoeybi et al., 2019) and trained on the NVIDIA Superpod, which includes 128 strongly clustered DGX servers with 1,024 A100 GPUs." - "It takes 13.4 days to train a model with 82B parameters with 150B tokens." Assume 300B tokens takes twice as long, 26.8 days. - Assume the default of 30% utilization rate for large language models. 1024 A100 GPUs * 312e12 FLOP/second * 0.3 utilization * 26.8 days * 24 * 60 * 60 seconds/day = 2.219e+23 FLOP
Size Notes: https://twitter.com/arankomatsuzaki/status/1397583304610783238 https://venturebeat.com/ai/naver-trained-a-gpt-3-like-korean-language-model/
Notes: https://www.navercorp.com/navercorp_/ir/announce/2023/NAVER_CEO%20letter%20to%20shareholders_Aug%202023_Eng.pdf