Model Details

Domain:

Image generation

Task:

Text-to-image

Image generation

Model Access:

Unreleased

Citations:

1316

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.

Benchmarking

FLOPs5.1e+23

Notes: Calculated from architecture. Does not take into account the encoding and decoding of text and images, only the transformer stack. Table 1 shows for the 20B model 16 encoder layers 64 decoder layers Dmodel = 4096 Dhidden = 16384 Num heads = 64 Just below table 1: "We use a maximum length of text tokens of 128, and the length of image tokens are fixed to 1024" I take the length of the sequence to be 100 for the encoder stack and 1024 for the decoder stack. Section 3, Training: "a total of 450,000 steps and final ratio of 0.025. We use a global batch size of 8192 during training." 6* 20B parameters * (1024+128) sequence length*450000 steps*8192 batch size= 5.096079e+23

Training

Training Code Accessibility"For these reasons, we have decided not to release our Parti models, code, or data for public use without further safeguards in place" https://sites.research.google/parti/

HardwareGoogle TPU v4

Parameters

Parameters20000000000

Notes: Abstract: "we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters"

Authors

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu

Related ModelsView all models

AudioLMBy Google Research

Audio

PaLM 540BBy Google Research

Language

LongT5By Google Research

Language

Routing Transformer WT-103By Google Research

Language

Google Research | Parti - Capabilities, Benchmarks and Use Cases

Model Details

Domain:

Image generation

Task:

Text-to-image

Image generation

Model Access:

Unreleased

Citations:

1316

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Benchmarking

FLOPs5.1e+23

Training

Training Code Accessibility"For these reasons, we have decided not to release our Parti models, code, or data for public use without further safeguards in place" https://sites.research.google/parti/

HardwareGoogle TPU v4