siglip-large-patch16-384

Maintained By
google

SigLIP Large Patch16-384

PropertyValue
Parameter Count652M
LicenseApache 2.0
Training DataWebLI Dataset
Resolution384x384
PaperSigmoid Loss for Language Image Pre-Training

What is siglip-large-patch16-384?

SigLIP is an advanced vision-language model that builds upon CLIP's architecture while introducing a revolutionary sigmoid loss function. This large variant, trained on 384x384 resolution images, represents a significant advancement in multimodal AI, particularly excelling at zero-shot image classification tasks.

Implementation Details

The model was trained on WebLI dataset using 16 TPU-v4 chips over three days. It processes images by resizing them to 384x384 resolution and normalizing them across RGB channels (mean: 0.5, std: 0.5). Text inputs are tokenized and padded to 64 tokens.

  • Improved loss function that doesn't require global similarity normalization
  • Supports larger batch sizes while maintaining performance
  • Processes both image and text inputs for multimodal understanding

Core Capabilities

  • Zero-shot image classification
  • Image-text retrieval
  • Multimodal understanding with high accuracy
  • Efficient processing of high-resolution images

Frequently Asked Questions

Q: What makes this model unique?

SigLIP's key innovation lies in its sigmoid loss function, which operates directly on image-text pairs without requiring global normalization. This allows for better scaling and improved performance even with smaller batch sizes compared to traditional CLIP models.

Q: What are the recommended use cases?

The model excels at zero-shot image classification and image-text retrieval tasks. It's particularly useful for applications requiring high-resolution image understanding (384x384) and flexible deployment scenarios where batch size optimization is crucial.

The first platform built for prompt engineering