Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.
Notes: We have that "iGPT-L was trained for roughly 2500 V100-days" [1] I assume this is the NVIDIA Tesla V100 GPU. In the specifications, the NVIDIA Tesla V100 has 7 to 8.2 TFLOPS of peak double precision performance and 14 to 16.4 TFLOPS of peak single precision performance and 112 to 130 TFLOPS of peak tensor performance [2]. I suppose the one that makes sense using if peak tensor performance, for ~125 TFLOPS peak tensor performance more or less. Following OpenAIs AI and compute we apply a 0.33 utitilization factor [3]. In total we get 2500 V100-days * (24*60*60) seconds/day * 125 TFLOPS * 0.33 = 8.91e+21 FLOPS = 89.1 PF-days. [1] https://openai.com/blog/image-gpt/ [2] https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf [3] https://openai.com/blog/ai-and-compute/
Size Notes: "We use the ImageNet ILSVRC 2012 training dataset, splitting off 4% as our experimental validation set and report results on the ILSVRC 2012 validation set as our test set." https://image-net.org/challenges/LSVRC/2012/ "The goal of this competition is to estimate the content of photographs for the purpose of retrieval and automatic annotation using a subset of the large hand-labeled ImageNet dataset (10,000,000 labeled images depicting 10,000+ object categories) as training."
Notes: source: https://openai.com/blog/image-gpt/#rfref53