Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
Notes: Akronomicon states 1.04e+22 FLOP. Archived source: https://github.com/lightonai/akronomicon/tree/main/akrodb However, this seems dubiously high. "We pre-train each model for 2^19 = 524,288 steps on C4 before fine-tuning." "In total, this batch size and number of steps corresponds to pre-training on 2^35 ≈ 34B tokens." "To compare these mixing strategies on equal footing with our baseline pre-train-then-fine-tune results, we train multi-task models for the same total number of steps: 2^19 + 2^18 = 786,432" Using the 6DN approximation gives: 6 FLOP/token/param * 2^35 pretrain tokens * (1+1/2 finetune tokens per pretrain token) * 1 iteration of training data* 2.8 billion parameters = 8.659e20 FLOP https://www.wolframalpha.com/input?i=6+*+2%5E35+*+2.8+billion+*+1.5 update: 9.0E+21 per FLAN paper from Google https://arxiv.org/pdf/2210.11416.pdf
Size Notes: "This produces a collection of text that is not only orders of magnitude larger than most data sets used for pre-training (about 750 GB) but also comprises reasonably clean and natural English text. We dub this data set the “Colossal Clean Crawled Corpus” (or C4 for short) and release it as part of TensorFlow Datasets" 750GB * 200M word/GB = 1.5e11 "In total, this batch size and number of steps corresponds to pre-training on 2^35 ≈ 34B tokens." "Note that 2^35 tokens only covers a fraction of the entire C4 data set, so we never repeat any data during pre-training." The fraction is 25.5 billion / 150 billion = 0.17 epochs.
Notes: page 37, 3B and 11B. "To further explore what kind of performance is possible when using larger models, we consider two additional variants. In both cases, we use d_model = 1024, a 24 layer encoder and decoder, and dkv = 128. For the “3B” variant, we use dff = 16,384 with 32-headed attention, which results in around 2.8 billion parameters; for “11B” we use dff = 65,536 with 128-headed attention producing a model with about 11 billion parameters"