Model Details

Domain:

Task:

Model Access:

Open weights (unrestricted)

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

AI Tools for You

Introduction

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention generally underperforms ordinary softmax attention. Moreover, current implementations of linear attention lack I/O-awareness and are thus slower than highly optimized implementations of softmax attention. This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. The resulting implementation, dubbed FLASHLINEARATTENTION, is faster than FLASHATTENTION-2 (Dao, 2023) as a standalone layer even on short sequence lengths (e.g., 1K). We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention (GLA) Transformer is found to perform competitively against the LLaMA-architecture Transformer (Touvron et al., 2023) as well recent linear-time-inference baselines such as RetNet (Sun et al., 2023a) and Mamba (Gu & Dao, 2023) on moderate-scale language modeling experiments. GLA Transformer is especially effective at length generalization, enabling a model trained on 2K to generalize to sequences longer than 20K without significant perplexity degradations. For training speed, the GLA Transformer has higher throughput than a similarly-sized Mamba model.

Benchmarking

FLOPs30600000000000000000

Notes: 6ND = 6*340*10^6*15*10^9 = 3.06e+19

Training

Training Code Accessibilityhttps://github.com/sustcsonglin/flash-linear-attention/blob/main/fla/layers/gla.py MIT license for code (seems like training code) https://huggingface.co/fla-hub/gla-340M-15B MIT license

Size Notes: 15B Tokens

Parameters

Parameters340000000

Notes: 340m

Authors

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim

Model Details

Domain:

Task:

Model Access:

Open weights (unrestricted)

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

AI Tools for You

Introduction

Benchmarking

FLOPs30600000000000000000

Notes: 6ND = 6*340*10^6*15*10^9 = 3.06e+19

Training

Size Notes: 15B Tokens

Parameters

Parameters340000000

Notes: 340m

Authors

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim

MIT-IBM Watson AI Lab,Massachusetts Institute of Technology (MIT) | GLA Transformer 340M - Capabilities, Benchmarks and Use Cases

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors