DFN5B-CLIP-ViT-H-14

Maintained By
apple

DFN5B-CLIP-ViT-H-14

PropertyValue
AuthorApple
LicenseApple Sample Code License
PaperData Filtering Networks
Training Data5B filtered images from 43B pairs
ImageNet Accuracy83.44%

What is DFN5B-CLIP-ViT-H-14?

DFN5B-CLIP-ViT-H-14 is a powerful CLIP model developed by Apple that leverages Data Filtering Networks (DFNs) to automatically curate training data. The model was trained on 5 billion images carefully filtered from a massive pool of 43 billion uncurated image-text pairs, including CommonPool-12.8B and additional public datasets.

Implementation Details

The model implements a ViT-H-14 architecture and has been converted from JAX to PyTorch for wider accessibility. It's designed for contrastive image-text learning and zero-shot image classification, achieving impressive results across multiple benchmarks.

  • Trained on filtered data using DFN technology
  • Supports both image and text encodings
  • Compatible with OpenCLIP framework
  • Achieves 83.44% accuracy on ImageNet-1K

Core Capabilities

  • Zero-shot image classification
  • Contrastive image-text learning
  • High performance on diverse datasets (98.9% on STL-10, 95.7% on Stanford Cars)
  • Robust cross-domain generalization

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its use of Data Filtering Networks (DFNs) to automatically curate training data, resulting in highly refined training on 5B images from a much larger pool. This approach leads to superior performance across various benchmarks while maintaining efficiency.

Q: What are the recommended use cases?

The model excels in zero-shot image classification, visual-semantic understanding, and cross-modal tasks. It's particularly suitable for applications requiring robust image understanding without task-specific training, such as content organization, visual search, and automated tagging systems.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.