paligemma-siglip-so400m-patch14-448

Property	Value
Architecture	SoViT-400m
Training Data	WebLI Dataset
Resolution	448x448 pixels
Paper	Sigmoid Loss for Language Image Pre-Training
Training Infrastructure	16 TPU-v4 chips (3 days)

What is paligemma-siglip-so400m-patch14-448?

This model is an implementation of SigLIP (Sigmoid Loss for Language Image Pre-Training) with shape optimization. It represents a significant advancement over traditional CLIP models by introducing a more efficient sigmoid loss function for image-text pair processing. The model uses the SoViT-400m architecture, specifically designed for compute-optimal performance as detailed in the research on getting ViT in shape.

Implementation Details

The model processes images at 448x448 resolution with RGB normalization (mean: 0.5, std: 0.5). Text inputs are tokenized and padded to 64 tokens. The implementation leverages a sigmoid loss function that operates directly on image-text pairs, eliminating the need for global similarity normalization.

Shape-optimized architecture (SoViT-400m)
Efficient sigmoid loss function
Pre-trained on WebLI dataset
Patch size of 14x14 pixels

Core Capabilities

Zero-shot image classification
Image-text retrieval
Efficient batch processing
Superior performance at both small and large batch sizes

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its sigmoid loss function implementation and shape-optimized architecture, which enables better scaling and performance compared to traditional CLIP models, while maintaining efficiency at various batch sizes.

Q: What are the recommended use cases?

The model excels in zero-shot image classification and image-text retrieval tasks. It's particularly suitable for applications requiring efficient processing of image-text pairs without the need for task-specific fine-tuning.