CLIP-convnext_large_d.laion2B-s26B-b102K-augreg

Property	Value
License	MIT
Training Dataset	LAION-2B
Resolution	256x256
Zero-shot ImageNet Accuracy	75.9%

What is CLIP-convnext_large_d.laion2B-s26B-b102K-augreg?

This is a groundbreaking CLIP model that utilizes the ConvNeXt-Large architecture, trained on LAION-2B dataset. It represents the first known ConvNeXt CLIP model trained at scale, achieving impressive efficiency with only half the training FLOPs compared to previous L/14 models while delivering superior performance.

Implementation Details

The model combines a ConvNeXt-Large image tower with a MLP head and an enhanced text tower featuring additional depth (16 layers, 768 embedding dimension). It's trained at 256x256 resolution with advanced augmentation techniques including Random Resize Crop, Random Erasing, and Stochastic Depth.

Trained on LAION-2B dataset with 26B samples seen
Uses timm ConvNeXt-Large as image tower
Implements MLP head in vision tower (fc-gelu-drop-fc)
Enhanced text tower with 4 additional layers

Core Capabilities

Zero-shot image classification with 75.9% ImageNet accuracy
Image and text retrieval
Efficient architecture with reduced computational requirements
Suitable for research and experimental applications

Frequently Asked Questions

Q: What makes this model unique?

This is the first ConvNeXt CLIP model trained at scale, achieving better performance than previous models while using fewer computational resources. It introduces innovative augmentation and regularization techniques for the image tower.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including zero-shot image classification, image-text retrieval, and as a foundation for transfer learning. It's not recommended for commercial deployment without thorough testing.