SigLIP SO400M

Property	Value
Parameter Count	877M
License	Apache 2.0
Training Data	WebLI Dataset
Architecture	SoViT-400m with 14x14 patches
Resolution	224x224

What is siglip-so400m-patch14-224?

SigLIP SO400M is a shape-optimized vision transformer model that implements an improved version of CLIP architecture. Developed by Google Research, it introduces a sigmoid loss function for processing image-text pairs, enabling better scaling and performance compared to traditional CLIP models. The model was trained on the WebLI dataset using 16 TPU-v4 chips over three days.

Implementation Details

The model processes images at 224x224 resolution, normalizing them across RGB channels with mean and standard deviation of 0.5. Text inputs are tokenized and padded to 64 tokens. The architecture leverages the SoViT-400m design, optimized through compute-optimal model scaling principles.

Shape-optimized Vision Transformer architecture
Sigmoid loss for improved batch processing
Pre-trained on extensive WebLI dataset
Efficient 14x14 patch processing

Core Capabilities

Zero-shot image classification
Image-text retrieval
Multimodal understanding
Flexible batch size processing

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its sigmoid loss function, which processes image-text pairs without requiring global similarity normalization, enabling better scaling and performance at various batch sizes.

Q: What are the recommended use cases?

The model excels in zero-shot image classification and image-text retrieval tasks. It's particularly suitable for applications requiring flexible batch processing and robust multimodal understanding without extensive task-specific training.

siglip-so400m-patch14-224