Model Details

Domain:

Vision

Task:

Visual question answering

Model Access:

Open weights (restricted use)

Introduction

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

Benchmarking

FLOPs

1.07e+22

Notes: 197000000000000 FLOP/s *256 GPUs *102 hours *3600 s / hour *0.55 [reported utilization] + 123000000000000 FLOP/s *32 GPUs *60 hours [for transfer] *3600 s / hour *0.55 [reported utilization]= 1.0652844e+22 FLOP

Training

Training Code Accessibility

https://www.kaggle.com/models/google/paligemma

Hardware

Google TPU v5e,Google TPU v3

Hardware Quantity

256

Size Notes: "We train Stage1 at resolution 224px (hence, 𝑁img = 256 image tokens) and sequence length 𝑁txt = 128 for a total of 1 billion examples." "For resolution 448, we train for an additional 50 M examples, and for resolution 896, we add another 10 M examples. <..> we also increase the text sequence length to 𝑁txt = 512 tokens." "Stage1 sees slightly less than 350 B tokens, and both Stage2 combined about 90 B tokens." 5189 tokens/second/device (reported) * 102*3600*256 = 487782604800 tokens

Parameters

3000000000

Notes: Model Components: Vision Encoder: SigLIP-So400m, a 400M parameter model pretrained with a contrastive objective. Language Model: Gemma-2B, a 2B parameter decoder-only model from the Gemma family of LLMs.

Authors

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bauer, Matko Bošnjak, Xi Chen, Matthias Minderer, Paul Voigtlaender, Ioana Bica, Ivana Balazevic, Joan Puigcerver, Pinelopi Papalampidi, Olivier Henaff, Xi Xiong, Radu Soricut, Jeremiah Harmsen, Xiaohua Zhai