CLIP-convnext_large_d.laion2B-s26B-b102K-augreg

Maintained By
laion

CLIP-convnext_large_d.laion2B-s26B-b102K-augreg

PropertyValue
LicenseMIT
Training DatasetLAION-2B
Resolution256x256
Zero-shot ImageNet Accuracy75.9%

What is CLIP-convnext_large_d.laion2B-s26B-b102K-augreg?

This is a groundbreaking CLIP model that utilizes the ConvNeXt-Large architecture, trained on LAION-2B dataset. It represents the first known ConvNeXt CLIP model trained at scale, achieving impressive efficiency with only half the training FLOPs compared to previous L/14 models while delivering superior performance.

Implementation Details

The model combines a ConvNeXt-Large image tower with a MLP head and an enhanced text tower featuring additional depth (16 layers, 768 embedding dimension). It's trained at 256x256 resolution with advanced augmentation techniques including Random Resize Crop, Random Erasing, and Stochastic Depth.

  • Trained on LAION-2B dataset with 26B samples seen
  • Uses timm ConvNeXt-Large as image tower
  • Implements MLP head in vision tower (fc-gelu-drop-fc)
  • Enhanced text tower with 4 additional layers

Core Capabilities

  • Zero-shot image classification with 75.9% ImageNet accuracy
  • Image and text retrieval
  • Efficient architecture with reduced computational requirements
  • Suitable for research and experimental applications

Frequently Asked Questions

Q: What makes this model unique?

This is the first ConvNeXt CLIP model trained at scale, achieving better performance than previous models while using fewer computational resources. It introduces innovative augmentation and regularization techniques for the image tower.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including zero-shot image classification, image-text retrieval, and as a foundation for transfer learning. It's not recommended for commercial deployment without thorough testing.

The first platform built for prompt engineering