Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call {\em Transient Global} (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
Notes: architecture is sparse so we cannot use 6ND method, from 3.1.1 "we simply replace the encoder self-attention operation in T5 with a sparse sliding- window local attention operation following the im- plementation in ETC " at the end of section 3.1.2 there is information about complexity O(l(r + l/k)) of local attention from 4.1.1 "We pre-train LongT5 models for 1M steps on 4096 input sequence length and 910 output se- quence length. batch size is 128 (from 4.1 configurations section) so with l = 4096, k = 16, r = 127, so l(r+l/k) = 1568768, but we are not sure about constant. if normal attention have complexity O(l^2), and l^2 = 16777216 16777216/1568768 = 10.7 We can try to estimate that LongT5 would have 10 times less compute that normal architecture.
Training Code AccessibilityApache 2.0: https://github.com/google-research/longt5 train code: https://github.com/google-research/longt5/blob/master/longt5/tasks.py
HardwareGoogle TPU v3
Hardware Quantity128
Size Notes: size of C4, from https://huggingface.co/datasets/c4 , C4 dataset is a collection of about 750GB of English-language text 200M word/GB * 4/3 token/word * 750GB = 200000000000 tokens Actual tokens seen: 1M steps * (4096 input len + 910 output len) * 128 batch size = 641B tokens, so around 3.2 epochs.
Parameters3000000000
Notes: 3B from section 4.1