SigLIP SO400M
Property | Value |
---|---|
Parameter Count | 877M |
License | Apache 2.0 |
Training Data | WebLI Dataset |
Architecture | SoViT-400m with 14x14 patches |
Resolution | 224x224 |
What is siglip-so400m-patch14-224?
SigLIP SO400M is a shape-optimized vision transformer model that implements an improved version of CLIP architecture. Developed by Google Research, it introduces a sigmoid loss function for processing image-text pairs, enabling better scaling and performance compared to traditional CLIP models. The model was trained on the WebLI dataset using 16 TPU-v4 chips over three days.
Implementation Details
The model processes images at 224x224 resolution, normalizing them across RGB channels with mean and standard deviation of 0.5. Text inputs are tokenized and padded to 64 tokens. The architecture leverages the SoViT-400m design, optimized through compute-optimal model scaling principles.
- Shape-optimized Vision Transformer architecture
- Sigmoid loss for improved batch processing
- Pre-trained on extensive WebLI dataset
- Efficient 14x14 patch processing
Core Capabilities
- Zero-shot image classification
- Image-text retrieval
- Multimodal understanding
- Flexible batch size processing
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its sigmoid loss function, which processes image-text pairs without requiring global similarity normalization, enabling better scaling and performance at various batch sizes.
Q: What are the recommended use cases?
The model excels in zero-shot image classification and image-text retrieval tasks. It's particularly suitable for applications requiring flexible batch processing and robust multimodal understanding without extensive task-specific training.