siglip-base-patch16-256-multilingual

Maintained By
google

SigLIP Base Multilingual Vision Model

PropertyValue
Parameter Count371M
LicenseApache 2.0
Resolution256x256
PaperSigmoid Loss for Language Image Pre-Training
Training DataWebLI Dataset

What is siglip-base-patch16-256-multilingual?

SigLIP is an advanced multimodal model that builds upon CLIP's architecture while introducing an innovative sigmoid loss function for improved image-text processing. This multilingual version supports cross-language image classification and retrieval tasks, trained on the comprehensive WebLI dataset without language filtering.

Implementation Details

The model processes images at 256x256 resolution with RGB normalization (mean: 0.5, std: 0.5). Text inputs are tokenized to 64 tokens, making it efficient for real-world applications. Training was conducted on 16 TPU-v4 chips over three days, resulting in a model that excels at zero-shot classification tasks.

  • Optimized sigmoid loss function for better scaling
  • Multilingual support for diverse applications
  • Efficient 256x256 resolution processing
  • 64-token text processing capacity

Core Capabilities

  • Zero-shot image classification across languages
  • Image-text retrieval and matching
  • Batch processing optimization
  • Multilingual understanding and classification

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its sigmoid loss function that operates directly on image-text pairs without requiring global similarity normalization, enabling better scaling and improved performance even with smaller batch sizes.

Q: What are the recommended use cases?

This model excels in zero-shot image classification, multilingual image-text retrieval, and general visual understanding tasks. It's particularly useful for applications requiring cross-lingual image classification without additional training.

The first platform built for prompt engineering