Introduction

Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call {\em Transient Global} (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.

Benchmarking

Notes: architecture is sparse so we cannot use 6ND method, from 3.1.1 "we simply replace the encoder self-attention operation in T5 with a sparse sliding- window local attention operation following the im- plementation in ETC " at the end of section 3.1.2 there is information about complexity O(l(r + l/k)) of local attention from 4.1.1 "We pre-train LongT5 models for 1M steps on 4096 input sequence length and 910 output se- quence length. batch size is 128 (from 4.1 configurations section) so with l = 4096, k = 16, r = 127, so l(r+l/k) = 1568768, but we are not sure about constant. if normal attention have complexity O(l^2), and l^2 = 16777216 16777216/1568768 = 10.7 We can try to estimate that LongT5 would have 10 times less compute that normal architecture.

Training

Training Code AccessibilityApache 2.0: https://github.com/google-research/longt5 train code: https://github.com/google-research/longt5/blob/master/longt5/tasks.py

HardwareGoogle TPU v3

Hardware Quantity128

Size Notes: size of C4, from https://huggingface.co/datasets/c4 , C4 dataset is a collection of about 750GB of English-language text 200M word/GB * 4/3 token/word * 750GB = 200000000000 tokens Actual tokens seen: 1M steps * (4096 input len + 910 output len) * 128 batch size = 641B tokens, so around 3.2 epochs.

Introduction

Benchmarking

Training

Training Code AccessibilityApache 2.0: https://github.com/google-research/longt5 train code: https://github.com/google-research/longt5/blob/master/longt5/tasks.py

HardwareGoogle TPU v3

Hardware Quantity128

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors