ViT-B-32__openai

Property	Value
Source Model	openai/clip-vit-base-patch32
Format	ONNX
Author	immich-app
Model Hub	Hugging Face

What is ViT-B-32__openai?

ViT-B-32__openai is an ONNX-optimized version of OpenAI's CLIP Vision Transformer model, specifically adapted for the Immich self-hosted photo library system. This implementation separates the original CLIP model into distinct visual and textual encoders, enabling efficient generation of image and text embeddings independently.

Implementation Details

The model is based on the Vision Transformer (ViT) architecture with a patch size of 32x32 pixels. It has been converted to ONNX format for improved deployment efficiency and cross-platform compatibility. The separation of encoders allows for more flexible usage in photo management applications.

Separated visual and text encoders for independent embedding generation
ONNX format optimization for deployment efficiency
Based on OpenAI's CLIP ViT-Base architecture
Specifically tailored for photo library applications

Core Capabilities

Generate image embeddings for visual similarity search
Create text embeddings for semantic matching
Enable cross-modal image-text matching
Support efficient photo organization and search

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its optimization for photo library applications through ONNX format conversion and the separation of CLIP encoders, making it particularly efficient for self-hosted photo management systems like Immich.

Q: What are the recommended use cases?

The model is specifically designed for photo library applications, particularly within the Immich ecosystem. It excels at tasks such as image similarity search, photo organization, and text-based image retrieval in self-hosted environments.

ViT-B-32__openai

ViT-B-32__openai

What is ViT-B-32__openai?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models