Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call {\em Transient Global} (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
LongT5 is an extension of the T5 model that handles long sequence inputs more efficiently. We integrated attention ideas from long-input transformers ETC,and adopted pre-training strategies from summarization pre-training PEGASUS into the scalable T5 architecture. The result is a new attention mechanism we call Transient Global(TGlobal), which mimics ETC’s local/globalattention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization and question answering tasks, as well as outperform the original T5 models on these tasks.
LongT5 achieves state-of-the-art performance on several summarization benchmarks that required longer context or multi-document understanding. The table is showing ROUGE-1 scores. LongT5 base models are all reported with 4k input tokens; large and xl models are trained with 16k tokens for arXiv, PubMed, BigPatent, 8k for MultiNews, and 4k for MediaSum and CNN/Daily News.
| Model | arXiv | PubMed | BigPatent | MultiNews | MediaSum | CNN/Daily Mail |
|---|---|---|---|---|---|---|
| DANCER PEGASUS | 45.01 | 46.34 | - | - | - | - |
| BigBird-PEGASUS (large) | 46.63 | 46.32 | 60.64 | - | - | - |
| HAT-BART | 46.68 | 48.36 | - | - | - | 44.48 |
| LED (large) | 46.64 | - | - | - | - | - |
| PRIMER | 47.6 | - | - | 49.9 | - | - |
| TG-MultiSum | - | - | - | 47.10 | - | - |
| BART (large) | - | - | - | - | 35.09 | - |
| LongT5 base | 44.87 | 47.77 | 60.95 | 46.01 | 35.09 | 42.15 |
| LongT5 large | 48.28 | 49.98 | 70.38 | 47.18 | 35.53 | 42.49 |
| LongT5 xl | 48.35 | 50.23 | 76.87 | 48.17 | 36.15 | 43.94 |
For NQ, we compare T5.1.1 and LongT5 with TGlobal attention. We decided to run T5.1.1 (1) with the default 512 input sequence length and (2) with the largest input sequence length that can fit into device memory, and use those as baselines. Since we are comparing against T5.1.1, for LongT5 experiments we report results at 512 input length for base and large, and the largest input length allowed by each model before running out of memory on the same hardware configuration used in our T5.1.1 experiments. For base and large models, we used 4x8 TPUv3 and no model partitioning; for xl model, we used 8x16 TPUv3 and 8 partitions.
| Model | EM | F1 |
|---|---|---|
| T5.1.1 base-512 | 50.93 | 52.54 |
| T5.1.1 base-6k | 56.73 | 56.73 |
| T5.1.1 large-512 | 57.29 | 60.68 |
| T5.1.1 large-3k | 60.09 | 64.17 |
| T5.1.1 xl-4k | 60.75 | 64.07 |
| LongT5 base-512 | 55.73 | 59.06 |
| LongT5 base-12k | 58.12 | 62.44 |
| LongT5 large-512 | 57.55 | 61.53 |
| LongT5 large-4k | 60.77 | 65.38 |
| LongT5 xl-8k | 62.66 | 66.61 |
Moreover, in our analysis for Input Length vs Speed and Input Length vs Performance sections using NQ, it shows that (1) at shorter sequence length T5.1.1 and LongT5 variants have similar speeds, but as we increase the sequence length, LongT5 becomes significantly faster, (2) T5.1.1 models reach their out-of-memory point much earlier than LongT5 models, and (3) performance increases significantly as input length increases.
For TriviaQA, we compare LongT5 with various top approaches on the leader board. All LongT5 models are reported with 16k input tokens.
| Model | EM | F1 |
|---|---|---|
| BigBird-ETC (random attn) | 80.86 | 84.5 |
| Fusion-in-Decoder | 80.09 | 84.35 |
| ReadTwice | 76.86 | 80.85 |
| LongT5 base | 74.67 | 78.9 |
| LongT5 large | 78.38 | 82.45 |
| LongT5 xl | 81.00 | 84.83 |
Most of our tasks are using Tensorflow Datasets which works directly with the SeqIO used in the T5 library. But for Natural Questions and MediaSum we provided our own data preprocessing code. To run the tasks corresponding to these datasets, please specify NQ_DATA_DIR and MEDIASUM_DATA_DIR to the output files produced by the preprocessing code in tasks.py.
Example command for running NQ data preprocessing:
# Data path where the NQ json files are downloaded to.
INPUT_PATH="..."
# Data path where the output files will be generated.
OUTPUT_PATH="..."
LONGT5_DIR="..." # directory where the LongT5 repo is cloned.
python3 ${LONGT5_DIR}/data/nq_preprocess.py \
--input_path=${INUT_PATH} \
--output_path=${OUTPUT_PATH}
The experiments are shown in the tasks.py file. Our architecture, model, and training configuration setups can be found in Flaxformer github repository.
We have released the following checkpoints for LongT5 pre-trained models:
Additionally, we have released the following checkpoints for mLongT5 pre-trained models:
If you use LongT5 in your research, please cite LongT5: Efficient Text-To-Text Transformer for Long Sequences.
@inproceedings{guo2022longt5,
title = "{L}ong{T}5: {E}fficient Text-To-Text Transformer for Long Sequences",
author = "Mandy Guo and Joshua Ainslie and David Uthus and Santiago Onta{\~n}{\'o}n and Jianmo Ni and Yun-Hsuan Sung and Yinfei Yang",
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
year = "2022",
url = "https://aclanthology.org/2022.findings-naacl.55",
pages = "724--736",
}
For mLongT5, please cite mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences.
@misc{uthus2023mlongt5,
title = "{mLongT5}: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences",
author = "David Uthus and Santiago Onta{\~n}{\'o}n and Joshua Ainslie and Mandy Guo",
year = "2023",
eprint = "2305.11129",
archivePrefix = "arXiv",
primaryClass = "cs.CL",
url = "https://arxiv.org/abs/2305.11129"
}
Notes: architecture is sparse so we cannot use 6ND method, from 3.1.1 "we simply replace the encoder self-attention operation in T5 with a sparse sliding- window local attention operation following the im- plementation in ETC " at the end of section 3.1.2 there is information about complexity O(l(r + l/k)) of local attention from 4.1.1 "We pre-train LongT5 models for 1M steps on 4096 input sequence length and 910 output se- quence length. batch size is 128 (from 4.1 configurations section) so with l = 4096, k = 16, r = 127, so l(r+l/k) = 1568768, but we are not sure about constant. if normal attention have complexity O(l^2), and l^2 = 16777216 16777216/1568768 = 10.7 We can try to estimate that LongT5 would have 10 times less compute that normal architecture.
Size Notes: size of C4, from https://huggingface.co/datasets/c4 , C4 dataset is a collection of about 750GB of English-language text 200M word/GB * 4/3 token/word * 750GB = 200000000000 tokens Actual tokens seen: 1M steps * (4096 input len + 910 output len) * 128 batch size = 641B tokens, so around 3.2 epochs.
Notes: 3B from section 4.1