Model Details

Domain:

Vision

Task:

Image classification

Image embedding

Model Access:

Open weights (unrestricted)

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

Benchmarking

FLOPs8.21e+22

Notes: 6 FLOP/parameter/token * 1140000000 parameters * 12000000000000 tokens [see dataset size notes] = 8.208e+22 FLOP

Training

Training Code AccessibilityApache 2.0 https://huggingface.co/google/siglip2-so400m-patch16-512

HardwareGoogle TPU v5e

Hardware Quantity2048

Size Notes: "We use the WebLI dataset [10] containing 10 billion images and 12 billion alt-texts covering 109 languages." "We set the batch size to 32k and use a cosine schedule with 20k warmup steps, training for a total of 40B examples." "For all model sizes, we set the vision encoder patch size to 16 and the image resolution to 256 (resulting in an image representation sequence length of 256)." 40 * 10^9 examples * 256 image tokens per example = 1.024e+13 image tokens (10T) "We set the text length to 64" 40 * 10^9 examples * 64 text tokens per example = 2.56e+12 text tokens (2T) total: 12T tokens

Parameters

Parameters1140000000

Notes: 1B

Authors

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai

Related ModelsView all models