CLIP Variants
Property | Value |
---|---|
License | MIT |
Format | ONNX |
Supported Architectures | ResNet-50/101, ViT-B/16, ViT-B/32, ViT-L/14 |
Precision Types | float32, float16, qint8, quint8 |
What is clip-variants?
CLIP-variants is a comprehensive collection of OpenAI's CLIP models converted into ONNX format, offering multiple architecture variants and precision types. The model provides both visual and textual processing capabilities, making it suitable for multimodal tasks.
Implementation Details
The repository contains converted versions of all available OpenAI CLIP models, split into two separate modes: visual and textual processing. Each model variant is available in multiple precision types to accommodate different performance and size requirements.
- Supports both ResNet and Vision Transformer (ViT) architectures
- Includes multiple model sizes from compact to large-scale
- Offers various precision types for flexibility in deployment
- Provides complete ONNX compatibility
Core Capabilities
- Zero-shot image classification
- Visual-textual alignment
- Multi-modal feature extraction
- Flexible deployment options with different precision types
- Support for both CNN and Transformer architectures
Frequently Asked Questions
Q: What makes this model unique?
This model collection provides ONNX-converted variants of CLIP, making it easier to deploy in various environments while offering multiple precision options for balancing performance and resource usage.
Q: What are the recommended use cases?
The models are suitable for zero-shot image classification, visual-textual alignment tasks, and general multimodal applications where image and text understanding is required. However, careful evaluation is recommended for specific deployment contexts.