CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup

CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup

laion

A high-performance CLIP model using ConvNeXt-Large architecture, trained on LAION-2B dataset, achieving 76.9% ImageNet accuracy with weight-averaged soup approach.

PropertyValue
LicenseMIT
Base ArchitectureConvNeXt-Large
Training DataLAION-2B
Resolution320x320
ImageNet Zero-Shot Accuracy76.9%

What is CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup?

This is an advanced CLIP model that combines ConvNeXt-Large architecture with enhanced text processing capabilities. It's a "model soup" created by averaging weights from three fine-tuned versions of the base model, each trained with different learning rates and sample sizes. The model operates at 320x320 resolution and demonstrates superior efficiency compared to the OpenAI L/14 model at 336x336, requiring significantly fewer computations while maintaining high performance.

Implementation Details

The model architecture features a ConvNeXt-Large image tower with a unique MLP head and an enhanced text tower with 16 layers and 768 embedding dimensions. It's trained on the LAION-2B dataset with specific augmentation techniques including Random Resize Crop, Random Erasing, and Stochastic Depth.

  • Vision tower: ConvNeXt-Large with MLP head (fc-gelu-drop-fc)
  • Text tower: 16 layers depth, 768 embedding dimensions
  • Training batch size: 131,072
  • Fine-tuning samples: ~2-3B

Core Capabilities

  • Zero-shot image classification (76.9% ImageNet accuracy)
  • Image and text retrieval
  • Transfer learning for downstream tasks
  • Image classification fine-tuning

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness comes from its "soup" approach, combining three fine-tuned models with different learning rates, and its efficient architecture that outperforms larger models while using fewer resources.

Q: What are the recommended use cases?

The model is best suited for research applications in zero-shot image classification, image-text retrieval, and as a foundation for transfer learning tasks. It's not recommended for production deployment without thorough testing and evaluation.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026