clip-variants

Maintained By
mlunar

CLIP Variants

PropertyValue
LicenseMIT
FormatONNX
Supported ArchitecturesResNet-50/101, ViT-B/16, ViT-B/32, ViT-L/14
Precision Typesfloat32, float16, qint8, quint8

What is clip-variants?

CLIP-variants is a comprehensive collection of OpenAI's CLIP models converted into ONNX format, offering multiple architecture variants and precision types. The model provides both visual and textual processing capabilities, making it suitable for multimodal tasks.

Implementation Details

The repository contains converted versions of all available OpenAI CLIP models, split into two separate modes: visual and textual processing. Each model variant is available in multiple precision types to accommodate different performance and size requirements.

  • Supports both ResNet and Vision Transformer (ViT) architectures
  • Includes multiple model sizes from compact to large-scale
  • Offers various precision types for flexibility in deployment
  • Provides complete ONNX compatibility

Core Capabilities

  • Zero-shot image classification
  • Visual-textual alignment
  • Multi-modal feature extraction
  • Flexible deployment options with different precision types
  • Support for both CNN and Transformer architectures

Frequently Asked Questions

Q: What makes this model unique?

This model collection provides ONNX-converted variants of CLIP, making it easier to deploy in various environments while offering multiple precision options for balancing performance and resource usage.

Q: What are the recommended use cases?

The models are suitable for zero-shot image classification, visual-textual alignment tasks, and general multimodal applications where image and text understanding is required. However, careful evaluation is recommended for specific deployment contexts.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.