CLIP-convnext_base_w-laion2B-s13B-b82K-augreg

Property	Value
License	MIT
Training Dataset	LAION-2B
Base Architecture	ConvNeXt-Base
Resolution	256x256
Paper	ConvNeXt Paper

What is CLIP-convnext_base_w-laion2B-s13B-b82K-augreg?

This model represents a significant advancement in CLIP architecture, utilizing ConvNeXt-Base as its image tower. It's trained on LAION-2B dataset with enhanced augmentation and regularization techniques, achieving an impressive 71.5% ImageNet zero-shot accuracy. The model particularly stands out for its efficient training, reaching high performance with only 13B samples compared to traditional approaches.

Implementation Details

The model implements several key technical innovations including Random Resize Crop (0.33, 1.0), Random Erasing (0.35), and Stochastic Depth (0.1) for improved regularization. It uses a global batch size of 81920 and was trained with precision amp_bfloat16.

Trained at 256x256 resolution
Uses timm ConvNeXt-Base architecture
Implements text tower similar to RN50x4 (depth 12, embed dim 640)
Features enhanced augmentation techniques for better generalization

Core Capabilities

Zero-shot image classification
Image and text retrieval
Fine-tuning for specific image tasks
Linear probe image classification
Image generation guidance

Frequently Asked Questions

Q: What makes this model unique?

This is the first known ConvNeXt CLIP model trained at scale, matching CLIP ViT-B/16 and RN50x4 models. It's also the first to explore increased augmentation and regularization for the image tower at this scale.

Q: What are the recommended use cases?

The model is best suited for research purposes, particularly in zero-shot image classification and text-image retrieval tasks. It's not recommended for deployed production use without thorough testing and evaluation.