PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
Notes: 197000000000000 FLOP/s *256 GPUs *102 hours *3600 s / hour *0.55 [reported utilization] + 123000000000000 FLOP/s *32 GPUs *60 hours [for transfer] *3600 s / hour *0.55 [reported utilization]= 1.0652844e+22 FLOP
Size Notes: "We train Stage1 at resolution 224px (hence, 𝑁img = 256 image tokens) and sequence length 𝑁txt = 128 for a total of 1 billion examples." "For resolution 448, we train for an additional 50 M examples, and for resolution 896, we add another 10 M examples. <..> we also increase the text sequence length to 𝑁txt = 512 tokens." "Stage1 sees slightly less than 350 B tokens, and both Stage2 combined about 90 B tokens." 5189 tokens/second/device (reported) * 102*3600*256 = 487782604800 tokens
Notes: Model Components: Vision Encoder: SigLIP-So400m, a 400M parameter model pretrained with a contrastive objective. Language Model: Gemma-2B, a 2B parameter decoder-only model from the Gemma family of LLMs.