Model Details

Domain:

Image generation

Task:

Text-to-image

Image generation

Model Access:

API access

Citations:

8041

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

Benchmarking

FLOPs3.37e+23

Notes: Decoder architecture is similar to Imagen (1.46E+22), but trained on 1.6e9 datapoints (Table 3) rather than Imagen's 5.1e9 datapoints. DALL-E 2 uses two models as priors. I estimate the prior model's FLOP as 6*N*D = 6 * 1e9 * 4096 * 1e6 = 2.5e19 FLOP. However, this seems low compared to CLIP. So it may be possible to estimate DALL-E 2's compute by analogy to Imagen, but there is a lot of uncertainty and more research would be needed. here (https://arxiv.org/pdf/2407.15811) they claim the DALL-E.2 model was trained on the equivalent of 5208.3 days on 8*A100 GPUs: 312000000000000 FLOP / sec / GPU * 8 GPUs * 5208.3 days * 24 hours / day * 3600 sec / hour * 0.3 [assumed utilization] = 3.3695784e+23 FLOP

Training

Size Notes: "When training the encoder, we sample from the CLIP [39] and DALL-E [40] datasets (approximately 650M images in total) with equal probability"

Parameters

Parameters3500000000

Notes: "Our decoder architecture is the 3.5 billion parameter GLIDE model"

Authors

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen

Related ModelsView all models

gpt-realtimeBy OpenAI

Speech

Vision

Language

OpenAI | DALL E 2 - Capabilities, Benchmarks and Use Cases

Model Details

Domain:

Image generation

Task:

Text-to-image

Image generation

Model Access:

API access

Citations:

8041

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Benchmarking

FLOPs3.37e+23

Training

Size Notes: "When training the encoder, we sample from the CLIP [39] and DALL-E [40] datasets (approximately 650M images in total) with equal probability"