Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
[Blog] [Paper] [Model Card] [Usage]
This is the official PyTorch package for the discrete VAE used for DALL·E. The transformer used to generate the images from the text is not part of this code release.
Before running the example notebook, you will need to install the package using
pip install DALL-E
Notes: source: https://lair.lighton.ai/akronomicon/ archived: https://github.com/lightonai/akronomicon/tree/main/akrodb VAE training "on 64 16 GB NVIDIA V100 GPUs, with a per-GPU batch size of 8, resulting in a total batch size of 512. It is trained for a total of 3,000,000 updates." Transformer training: "We trained the model using 1024, 16 GB NVIDIA V100 GPUs and a total batch size of 1024, for a total of 430,000 updates."; "We concatenate up to 256 BPE-encoded text tokens with the 32 × 32 = 1024 image tokens" Total tokens: 430000 steps * 1024 batch size * 1280 sequence length = 563609600000 Transformer FLOP: 6 * 12B parameter * 563609600000 tokens = 4.057989e+22 Estimating the VAE at ~15% seems reasonable
Size Notes: "To scale up to 12-billion parameters, we created a dataset of a similar scale to JFT-300M (Sun et al., 2017) by collecting 250 million text-images pairs from the internet. " number of epochs: 1024 batch size * 430,000 updates / 250,000,000 = 1.76
Notes: DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions