Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
FLOPs4.7e+22
Notes: source: https://lair.lighton.ai/akronomicon/ archived: https://github.com/lightonai/akronomicon/tree/main/akrodb VAE training "on 64 16 GB NVIDIA V100 GPUs, with a per-GPU batch size of 8, resulting in a total batch size of 512. It is trained for a total of 3,000,000 updates." Transformer training: "We trained the model using 1024, 16 GB NVIDIA V100 GPUs and a total batch size of 1024, for a total of 430,000 updates."; "We concatenate up to 256 BPE-encoded text tokens with the 32 × 32 = 1024 image tokens" Total tokens: 430000 steps * 1024 batch size * 1280 sequence length = 563609600000 Transformer FLOP: 6 * 12B parameter * 563609600000 tokens = 4.057989e+22 Estimating the VAE at ~15% seems reasonable
HardwareNVIDIA Tesla V100 DGXS 16 GB
Hardware Quantity1024
Size Notes: "To scale up to 12-billion parameters, we created a dataset of a similar scale to JFT-300M (Sun et al., 2017) by collecting 250 million text-images pairs from the internet. " number of epochs: 1024 batch size * 430,000 updates / 250,000,000 = 1.76
Parameters12000000000
Notes: DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions