We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.
FLOPs5.1e+23
Notes: Calculated from architecture. Does not take into account the encoding and decoding of text and images, only the transformer stack. Table 1 shows for the 20B model 16 encoder layers 64 decoder layers Dmodel = 4096 Dhidden = 16384 Num heads = 64 Just below table 1: "We use a maximum length of text tokens of 128, and the length of image tokens are fixed to 1024" I take the length of the sequence to be 100 for the encoder stack and 1024 for the decoder stack. Section 3, Training: "a total of 450,000 steps and final ratio of 0.025. We use a global batch size of 8192 during training." 6* 20B parameters * (1024+128) sequence length*450000 steps*8192 batch size= 5.096079e+23
Training Code Accessibility"For these reasons, we have decided not to release our Parti models, code, or data for public use without further safeguards in place" https://sites.research.google/parti/
HardwareGoogle TPU v4
Parameters20000000000
Notes: Abstract: "we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters"