State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards "small". Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.
Notes: Table 6: 153e9 mult-adds. Section 2.4: "minibatches of 8,064 images". Compute = 2 * 3 * mult-adds * dataset size = 2 * 3 * 153e9 * 9525e6 = 8.74e21 FLOP Likely trained on V100s, since Facebook had just upgraded their Big Basin GPU cluster to V100s as of March 2018. The previous iteration of Big Basin had 32 clusters of 8xP100s, while Big Basin v2 had 42 clusters of 8xV100s, which matches the 336 GPUs used in this paper.
Size Notes: Table 3: (300+1925+300+7000) million images
Notes: Table 6