Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.
Notes: source: https://lair.lighton.ai/akronomicon/ archived: https://github.com/lightonai/akronomicon/tree/main/akrodb
Size Notes: "We collected about 30 million text-image pairs from multiple channels, and built a 2.5TB new dataset (after tokenization, the size becomes about 250GB)." 250GB * (1 word / 5 bytes) = 50 billion words or 67 billion tokens So 30M text-image pairs and 50 billion words
Notes: "We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem."