CLIP-ViT-B-32-laion2B-s34B-b79K

Maintained By
laion

CLIP-ViT-B-32-laion2B-s34B-b79K

PropertyValue
Parameter Count151M
LicenseMIT
FrameworkPyTorch
ImageNet Accuracy66.6% (Zero-shot)
Training DataLAION-2B

What is CLIP-ViT-B-32-laion2B-s34B-b79K?

This is a Vision Transformer (ViT) based CLIP model trained on the LAION-2B English subset of LAION-5B dataset. Developed by LAION and trained on stability.ai's infrastructure, it represents a powerful vision-language model capable of zero-shot image classification and multimodal understanding.

Implementation Details

The model utilizes a ViT-B/32 architecture implemented using OpenCLIP framework. It features 151M parameters and employs both integer (I64) and floating-point (F32) tensor types for computation.

  • Trained on 2 billion English image-text pairs
  • Implements Vision Transformer architecture with 32x32 patch size
  • Supports both direct use and downstream applications

Core Capabilities

  • Zero-shot image classification
  • Image and text retrieval
  • Transfer learning for downstream tasks
  • Image classification fine-tuning
  • Linear probe classification
  • Image generation guidance

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its training on the extensive LAION-2B dataset and achieves impressive zero-shot classification performance (66.6% on ImageNet-1k) while maintaining a relatively compact size of 151M parameters.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, specifically in zero-shot image classification and multimodal understanding. It's suitable for controlled environment testing but not recommended for direct deployment without thorough evaluation.

The first platform built for prompt engineering