Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.
Notes: Decoder architecture is similar to Imagen (1.46E+22), but trained on 1.6e9 datapoints (Table 3) rather than Imagen's 5.1e9 datapoints. DALL-E 2 uses two models as priors. I estimate the prior model's FLOP as 6*N*D = 6 * 1e9 * 4096 * 1e6 = 2.5e19 FLOP. However, this seems low compared to CLIP. So it may be possible to estimate DALL-E 2's compute by analogy to Imagen, but there is a lot of uncertainty and more research would be needed. here (https://arxiv.org/pdf/2407.15811) they claim the DALL-E.2 model was trained on the equivalent of 5208.3 days on 8*A100 GPUs: 312000000000000 FLOP / sec / GPU * 8 GPUs * 5208.3 days * 24 hours / day * 3600 sec / hour * 0.3 [assumed utilization] = 3.3695784e+23 FLOP
Size Notes: "When training the encoder, we sample from the CLIP [39] and DALL-E [40] datasets (approximately 650M images in total) with equal probability"
Notes: "Our decoder architecture is the 3.5 billion parameter GLIDE model"