paligemma-siglip-so400m-patch14-448
Property | Value |
---|---|
Architecture | SoViT-400m |
Training Data | WebLI Dataset |
Resolution | 448x448 pixels |
Paper | Sigmoid Loss for Language Image Pre-Training |
Training Infrastructure | 16 TPU-v4 chips (3 days) |
What is paligemma-siglip-so400m-patch14-448?
This model is an implementation of SigLIP (Sigmoid Loss for Language Image Pre-Training) with shape optimization. It represents a significant advancement over traditional CLIP models by introducing a more efficient sigmoid loss function for image-text pair processing. The model uses the SoViT-400m architecture, specifically designed for compute-optimal performance as detailed in the research on getting ViT in shape.
Implementation Details
The model processes images at 448x448 resolution with RGB normalization (mean: 0.5, std: 0.5). Text inputs are tokenized and padded to 64 tokens. The implementation leverages a sigmoid loss function that operates directly on image-text pairs, eliminating the need for global similarity normalization.
- Shape-optimized architecture (SoViT-400m)
- Efficient sigmoid loss function
- Pre-trained on WebLI dataset
- Patch size of 14x14 pixels
Core Capabilities
- Zero-shot image classification
- Image-text retrieval
- Efficient batch processing
- Superior performance at both small and large batch sizes
Frequently Asked Questions
Q: What makes this model unique?
The model's uniqueness lies in its sigmoid loss function implementation and shape-optimized architecture, which enables better scaling and performance compared to traditional CLIP models, while maintaining efficiency at various batch sizes.
Q: What are the recommended use cases?
The model excels in zero-shot image classification and image-text retrieval tasks. It's particularly suitable for applications requiring efficient processing of image-text pairs without the need for task-specific fine-tuning.