SigLIP Large Patch16-384
Property | Value |
---|---|
Parameter Count | 652M |
License | Apache 2.0 |
Training Data | WebLI Dataset |
Resolution | 384x384 |
Paper | Sigmoid Loss for Language Image Pre-Training |
What is siglip-large-patch16-384?
SigLIP is an advanced vision-language model that builds upon CLIP's architecture while introducing a revolutionary sigmoid loss function. This large variant, trained on 384x384 resolution images, represents a significant advancement in multimodal AI, particularly excelling at zero-shot image classification tasks.
Implementation Details
The model was trained on WebLI dataset using 16 TPU-v4 chips over three days. It processes images by resizing them to 384x384 resolution and normalizing them across RGB channels (mean: 0.5, std: 0.5). Text inputs are tokenized and padded to 64 tokens.
- Improved loss function that doesn't require global similarity normalization
- Supports larger batch sizes while maintaining performance
- Processes both image and text inputs for multimodal understanding
Core Capabilities
- Zero-shot image classification
- Image-text retrieval
- Multimodal understanding with high accuracy
- Efficient processing of high-resolution images
Frequently Asked Questions
Q: What makes this model unique?
SigLIP's key innovation lies in its sigmoid loss function, which operates directly on image-text pairs without requiring global normalization. This allows for better scaling and improved performance even with smaller batch sizes compared to traditional CLIP models.
Q: What are the recommended use cases?
The model excels at zero-shot image classification and image-text retrieval tasks. It's particularly useful for applications requiring high-resolution image understanding (384x384) and flexible deployment scenarios where batch size optimization is crucial.