TinyCLIP-ViT-8M-16-Text-3M-YFCC15M
Property | Value |
---|---|
Parameter Count | 23.4M parameters |
Model Type | Zero-Shot Image Classification |
License | MIT |
Training Data | YFCC15M |
ImageNet Accuracy | 41.1% |
What is TinyCLIP-ViT-8M-16-Text-3M-YFCC15M?
TinyCLIP-ViT-8M-16-Text-3M-YFCC15M is a compact version of CLIP, developed through an innovative cross-modal distillation approach. Published in ICCV 2023, this model represents a significant advancement in efficient vision-language models, utilizing only 2.0 MACs while maintaining impressive performance for its size.
Implementation Details
The model implements two core techniques: affinity mimicking and weight inheritance, which enable effective knowledge distillation from larger CLIP models. It uses a ViT architecture with a 16-pixel patch size and has been optimized for both speed and accuracy.
- Efficient architecture with only 23.4M parameters
- Trained on YFCC15M dataset
- Achieves 41.1% accuracy on ImageNet
- Processes 4,150 image-text pairs per second
Core Capabilities
- Zero-shot image classification
- Cross-modal understanding between images and text
- Efficient inference with minimal computational requirements
- Suitable for resource-constrained environments
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its extremely efficient architecture, achieving remarkable performance with just 23.4M parameters through innovative distillation techniques. It's particularly notable for its high throughput of 4,150 image-text pairs per second.
Q: What are the recommended use cases?
The model is ideal for applications requiring zero-shot image classification in resource-constrained environments, such as mobile devices or edge computing scenarios. It's particularly suitable for tasks needing real-time image-text understanding without extensive computational resources.