siglip-so400m-patch14-224

Maintained By
google

SigLIP SO400M

PropertyValue
Parameter Count877M
LicenseApache 2.0
Training DataWebLI Dataset
ArchitectureSoViT-400m with 14x14 patches
Resolution224x224

What is siglip-so400m-patch14-224?

SigLIP SO400M is a shape-optimized vision transformer model that implements an improved version of CLIP architecture. Developed by Google Research, it introduces a sigmoid loss function for processing image-text pairs, enabling better scaling and performance compared to traditional CLIP models. The model was trained on the WebLI dataset using 16 TPU-v4 chips over three days.

Implementation Details

The model processes images at 224x224 resolution, normalizing them across RGB channels with mean and standard deviation of 0.5. Text inputs are tokenized and padded to 64 tokens. The architecture leverages the SoViT-400m design, optimized through compute-optimal model scaling principles.

  • Shape-optimized Vision Transformer architecture
  • Sigmoid loss for improved batch processing
  • Pre-trained on extensive WebLI dataset
  • Efficient 14x14 patch processing

Core Capabilities

  • Zero-shot image classification
  • Image-text retrieval
  • Multimodal understanding
  • Flexible batch size processing

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its sigmoid loss function, which processes image-text pairs without requiring global similarity normalization, enabling better scaling and performance at various batch sizes.

Q: What are the recommended use cases?

The model excels in zero-shot image classification and image-text retrieval tasks. It's particularly suitable for applications requiring flexible batch processing and robust multimodal understanding without extensive task-specific training.

The first platform built for prompt engineering