SigLIP Large Patch16-256
Property | Value |
---|---|
Parameter Count | 652M |
License | Apache 2.0 |
Paper | Sigmoid Loss for Language Image Pre-Training |
Training Data | WebLI Dataset |
Resolution | 256x256 |
What is siglip-large-patch16-256?
SigLIP (Sigmoid Loss for Language Image Pre-Training) is an advanced vision-language model that builds upon CLIP's architecture while introducing a novel sigmoid loss function. This large variant, with 652M parameters, is specifically designed for processing 256x256 resolution images and offers superior performance in zero-shot image classification and image-text retrieval tasks.
Implementation Details
The model was trained on the WebLI dataset using 16 TPU-v4 chips over three days. It processes images by resizing them to 256x256 resolution and normalizing them across RGB channels with mean and standard deviation of 0.5. Text inputs are tokenized and padded to 64 tokens.
- Utilizes a transformer-based architecture
- Implements sigmoid loss for improved scaling and performance
- Supports F32 tensor operations
- Optimized for 256x256 image resolution
Core Capabilities
- Zero-shot image classification
- Image-text similarity scoring
- Multimodal understanding
- Efficient batch processing
Frequently Asked Questions
Q: What makes this model unique?
SigLIP's key innovation lies in its sigmoid loss function, which operates directly on image-text pairs without requiring global similarity normalization. This enables better scaling with batch sizes while maintaining strong performance even at smaller batch sizes.
Q: What are the recommended use cases?
The model excels in zero-shot image classification, image-text retrieval, and general visual understanding tasks. It's particularly useful for applications requiring robust image-text matching without specific training data.