CLIP-convnext_base_w-laion2B-s13B-b82K-augreg

Maintained By
laion

CLIP-convnext_base_w-laion2B-s13B-b82K-augreg

PropertyValue
LicenseMIT
Training DatasetLAION-2B
Base ArchitectureConvNeXt-Base
Resolution256x256
PaperConvNeXt Paper

What is CLIP-convnext_base_w-laion2B-s13B-b82K-augreg?

This model represents a significant advancement in CLIP architecture, utilizing ConvNeXt-Base as its image tower. It's trained on LAION-2B dataset with enhanced augmentation and regularization techniques, achieving an impressive 71.5% ImageNet zero-shot accuracy. The model particularly stands out for its efficient training, reaching high performance with only 13B samples compared to traditional approaches.

Implementation Details

The model implements several key technical innovations including Random Resize Crop (0.33, 1.0), Random Erasing (0.35), and Stochastic Depth (0.1) for improved regularization. It uses a global batch size of 81920 and was trained with precision amp_bfloat16.

  • Trained at 256x256 resolution
  • Uses timm ConvNeXt-Base architecture
  • Implements text tower similar to RN50x4 (depth 12, embed dim 640)
  • Features enhanced augmentation techniques for better generalization

Core Capabilities

  • Zero-shot image classification
  • Image and text retrieval
  • Fine-tuning for specific image tasks
  • Linear probe image classification
  • Image generation guidance

Frequently Asked Questions

Q: What makes this model unique?

This is the first known ConvNeXt CLIP model trained at scale, matching CLIP ViT-B/16 and RN50x4 models. It's also the first to explore increased augmentation and regularization for the image tower at this scale.

Q: What are the recommended use cases?

The model is best suited for research purposes, particularly in zero-shot image classification and text-image retrieval tasks. It's not recommended for deployed production use without thorough testing and evaluation.

The first platform built for prompt engineering