CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup
Property | Value |
---|---|
License | MIT |
Base Architecture | ConvNeXt-Large |
Training Data | LAION-2B |
Resolution | 320x320 |
ImageNet Zero-Shot Accuracy | 76.9% |
What is CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup?
This is an advanced CLIP model that combines ConvNeXt-Large architecture with enhanced text processing capabilities. It's a "model soup" created by averaging weights from three fine-tuned versions of the base model, each trained with different learning rates and sample sizes. The model operates at 320x320 resolution and demonstrates superior efficiency compared to the OpenAI L/14 model at 336x336, requiring significantly fewer computations while maintaining high performance.
Implementation Details
The model architecture features a ConvNeXt-Large image tower with a unique MLP head and an enhanced text tower with 16 layers and 768 embedding dimensions. It's trained on the LAION-2B dataset with specific augmentation techniques including Random Resize Crop, Random Erasing, and Stochastic Depth.
- Vision tower: ConvNeXt-Large with MLP head (fc-gelu-drop-fc)
- Text tower: 16 layers depth, 768 embedding dimensions
- Training batch size: 131,072
- Fine-tuning samples: ~2-3B
Core Capabilities
- Zero-shot image classification (76.9% ImageNet accuracy)
- Image and text retrieval
- Transfer learning for downstream tasks
- Image classification fine-tuning
Frequently Asked Questions
Q: What makes this model unique?
This model's uniqueness comes from its "soup" approach, combining three fine-tuned models with different learning rates, and its efficient architecture that outperforms larger models while using fewer resources.
Q: What are the recommended use cases?
The model is best suited for research applications in zero-shot image classification, image-text retrieval, and as a foundation for transfer learning tasks. It's not recommended for production deployment without thorough testing and evaluation.