ViT-B-32__openai
Property | Value |
---|---|
Source Model | openai/clip-vit-base-patch32 |
Format | ONNX |
Author | immich-app |
Model Hub | Hugging Face |
What is ViT-B-32__openai?
ViT-B-32__openai is an ONNX-optimized version of OpenAI's CLIP Vision Transformer model, specifically adapted for the Immich self-hosted photo library system. This implementation separates the original CLIP model into distinct visual and textual encoders, enabling efficient generation of image and text embeddings independently.
Implementation Details
The model is based on the Vision Transformer (ViT) architecture with a patch size of 32x32 pixels. It has been converted to ONNX format for improved deployment efficiency and cross-platform compatibility. The separation of encoders allows for more flexible usage in photo management applications.
- Separated visual and text encoders for independent embedding generation
- ONNX format optimization for deployment efficiency
- Based on OpenAI's CLIP ViT-Base architecture
- Specifically tailored for photo library applications
Core Capabilities
- Generate image embeddings for visual similarity search
- Create text embeddings for semantic matching
- Enable cross-modal image-text matching
- Support efficient photo organization and search
Frequently Asked Questions
Q: What makes this model unique?
This model's uniqueness lies in its optimization for photo library applications through ONNX format conversion and the separation of CLIP encoders, making it particularly efficient for self-hosted photo management systems like Immich.
Q: What are the recommended use cases?
The model is specifically designed for photo library applications, particularly within the Immich ecosystem. It excels at tasks such as image similarity search, photo organization, and text-based image retrieval in self-hosted environments.