Model Details

Domain:

Image generation

Task:

Text-to-image

Image generation

Model Access:

API access

Citations:

5794

Introduction

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

GitHub README

Overview

[Blog] [Paper] [Model Card] [Usage]

This is the official PyTorch package for the discrete VAE used for DALL·E. The transformer used to generate the images from the text is not part of this code release.

Installation

Before running the example notebook, you will need to install the package using

pip install DALL-E

Benchmarking

FLOPs

4.7e+22

Notes: source: https://lair.lighton.ai/akronomicon/ archived: https://github.com/lightonai/akronomicon/tree/main/akrodb VAE training "on 64 16 GB NVIDIA V100 GPUs, with a per-GPU batch size of 8, resulting in a total batch size of 512. It is trained for a total of 3,000,000 updates." Transformer training: "We trained the model using 1024, 16 GB NVIDIA V100 GPUs and a total batch size of 1024, for a total of 430,000 updates."; "We concatenate up to 256 BPE-encoded text tokens with the 32 × 32 = 1024 image tokens" Total tokens: 430000 steps * 1024 batch size * 1280 sequence length = 563609600000 Transformer FLOP: 6 * 12B parameter * 563609600000 tokens = 4.057989e+22 Estimating the VAE at ~15% seems reasonable

Training

Hardware

NVIDIA Tesla V100 DGXS 16 GB

Hardware Quantity

1024

Size Notes: "To scale up to 12-billion parameters, we created a dataset of a similar scale to JFT-300M (Sun et al., 2017) by collecting 250 million text-images pairs from the internet. " number of epochs: 1024 batch size * 430,000 updates / 250,000,000 = 1.76

Parameters

12000000000

Notes: DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions

Authors

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever

Related Models

OpenAI | DALL-E , Capabilities, Benchmarks and Use Cases, 2026

Model Details

Domain:

Image generation

Task:

Text-to-image

Image generation

Model Access:

API access

Citations:

5794

Introduction

GitHub README

Overview

[Blog] [Paper] [Model Card] [Usage]

This is the official PyTorch package for the discrete VAE used for DALL·E. The transformer used to generate the images from the text is not part of this code release.