GPT-3 shows remarkable in-context learning ability of large-scale language models (LMs) trained on hundreds of billion scale data. Here we address some remaining issues less reported by the GPT-3 paper, such as a non-English LM, the performances of different sized models, and the effect of recently introduced prompt optimization on in-context learning. To achieve this, we introduce HyperCLOVA, a Korean variant of 82B GPT-3 trained on a Korean-centric corpus of 560B tokens. Enhanced by our Korean-specific tokenization, HyperCLOVA with our training configuration shows state-of-the-art in-context zero-shot and few-shot learning performances on various downstream tasks in Korean. Also, we show the performance benefits of prompt-based learning and demonstrate how it can be integrated into the prompt engineering pipeline. Then we discuss the possibility of materializing the No Code AI paradigm by providing AI prototyping capabilities to non-experts of ML by introducing HyperCLOVA studio, an interactive prompt engineering interface. Lastly, we demonstrate the potential of our methods with three successful in-house applications.
Notes: "For experiments in Section 4, the model trained with 150B is used for fair comparison, because not all models are finished training at the same iteration. However, experiments in Section 5.2 use the model trained with 300B tokens, as HyperCLOVA Studio provided the 39B and 82B models trained with 300B tokens." 82e9 connections * 2 FLOP/connection * 300e9 tokens * 3 backward pass = 1.476e23 FLOP Calculation using GPU time corroborates this: - "Our model is based on megatron-LM (Shoeybi et al., 2019) and trained on the NVIDIA Superpod, which includes 128 strongly clustered DGX servers with 1,024 A100 GPUs." - "It takes 13.4 days to train a model with 82B parameters with 150B tokens." Assume 300B tokens takes twice as long, 26.8 days. - Assume the default of 30% utilization rate for large language models. 1024 A100 GPUs * 312e12 FLOP/second * 0.3 utilization * 26.8 days * 24 * 60 * 60 seconds/day = 2.219e+23 FLOP
Size Notes: "However, experiments in Section 5.2 use the model trained with 300B tokens, as HyperCLOVA Studio provided the 39B and 82B models trained with 300B tokens." "We introduce HyperCLOVA, a large-scale Korean in-context learning-based LM with nearly 100B parameters, by constructing a large Korean-centric corpus of 560B tokens." Based on tokenizing the Hyperclova article itself using OpenAI's tiktoken BPE tokenizer (https://github.com/openai/tiktoken), there are 3285 tokens for 1069 words - about 3 tokens per word. This work uses a special tokenizer, but based on Figure 5 in Appendix E, the number of tokens seems similar between different tokenization methods. Based on that, 5.6e11 Korean tokens ~= 1.9e11 words
Notes: "We introduce a Korean in-context large-scale LM with 82B parameters, i.e., HyperCLOVA. This is the first discovery on near 100B-scale non-English LM." According to media reports, HyperCLOVA has 204B parameters (i.e. a different version than in the paper) https://m.koreaherald.com/view.php?ud=20210525000824