CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup

Maintained By
laion

CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup

PropertyValue
LicenseMIT
Base ArchitectureConvNeXt-Large
Training DataLAION-2B
Resolution320x320
ImageNet Zero-Shot Accuracy76.9%

What is CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup?

This is an advanced CLIP model that combines ConvNeXt-Large architecture with enhanced text processing capabilities. It's a "model soup" created by averaging weights from three fine-tuned versions of the base model, each trained with different learning rates and sample sizes. The model operates at 320x320 resolution and demonstrates superior efficiency compared to the OpenAI L/14 model at 336x336, requiring significantly fewer computations while maintaining high performance.

Implementation Details

The model architecture features a ConvNeXt-Large image tower with a unique MLP head and an enhanced text tower with 16 layers and 768 embedding dimensions. It's trained on the LAION-2B dataset with specific augmentation techniques including Random Resize Crop, Random Erasing, and Stochastic Depth.

  • Vision tower: ConvNeXt-Large with MLP head (fc-gelu-drop-fc)
  • Text tower: 16 layers depth, 768 embedding dimensions
  • Training batch size: 131,072
  • Fine-tuning samples: ~2-3B

Core Capabilities

  • Zero-shot image classification (76.9% ImageNet accuracy)
  • Image and text retrieval
  • Transfer learning for downstream tasks
  • Image classification fine-tuning

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness comes from its "soup" approach, combining three fine-tuned models with different learning rates, and its efficient architecture that outperforms larger models while using fewer resources.

Q: What are the recommended use cases?

The model is best suited for research applications in zero-shot image classification, image-text retrieval, and as a foundation for transfer learning tasks. It's not recommended for production deployment without thorough testing and evaluation.

The first platform built for prompt engineering