siglip-large-patch16-256

Maintained By
google

SigLIP Large Patch16-256

PropertyValue
Parameter Count652M
LicenseApache 2.0
PaperSigmoid Loss for Language Image Pre-Training
Training DataWebLI Dataset
Resolution256x256

What is siglip-large-patch16-256?

SigLIP (Sigmoid Loss for Language Image Pre-Training) is an advanced vision-language model that builds upon CLIP's architecture while introducing a novel sigmoid loss function. This large variant, with 652M parameters, is specifically designed for processing 256x256 resolution images and offers superior performance in zero-shot image classification and image-text retrieval tasks.

Implementation Details

The model was trained on the WebLI dataset using 16 TPU-v4 chips over three days. It processes images by resizing them to 256x256 resolution and normalizing them across RGB channels with mean and standard deviation of 0.5. Text inputs are tokenized and padded to 64 tokens.

  • Utilizes a transformer-based architecture
  • Implements sigmoid loss for improved scaling and performance
  • Supports F32 tensor operations
  • Optimized for 256x256 image resolution

Core Capabilities

  • Zero-shot image classification
  • Image-text similarity scoring
  • Multimodal understanding
  • Efficient batch processing

Frequently Asked Questions

Q: What makes this model unique?

SigLIP's key innovation lies in its sigmoid loss function, which operates directly on image-text pairs without requiring global similarity normalization. This enables better scaling with batch sizes while maintaining strong performance even at smaller batch sizes.

Q: What are the recommended use cases?

The model excels in zero-shot image classification, image-text retrieval, and general visual understanding tasks. It's particularly useful for applications requiring robust image-text matching without specific training data.

The first platform built for prompt engineering