DFN5B-CLIP-ViT-H-14-378

Maintained By
apple

DFN5B-CLIP-ViT-H-14-378

PropertyValue
LicenseApple Sample Code License
Research PaperData Filtering Networks
Training Data5B filtered images from 43B pool
ArchitectureCLIP ViT-H-14

What is DFN5B-CLIP-ViT-H-14-378?

DFN5B-CLIP-ViT-H-14-378 is an advanced CLIP model developed by Apple that leverages Data Filtering Networks (DFNs) to train on carefully curated image-text pairs. The model was trained on 5 billion images selected from a massive pool of 43 billion uncurated pairs, using innovative filtering techniques to ensure quality.

Implementation Details

The model implements a Vision Transformer (ViT) architecture with a hierarchical design, converted from JAX to PyTorch for wider accessibility. It processes images at 384x384 resolution and features both image and text encoding capabilities compatible with the OpenCLIP framework.

  • Trained on 39B samples at 224x224 resolution + 5B at 384x384 resolution
  • Achieves 84.2% accuracy on ImageNet-1K
  • Implements contrastive learning for zero-shot classification

Core Capabilities

  • Zero-shot image classification with strong performance across diverse datasets
  • Robust performance on challenging variants like ImageNet-A (79.9%) and ImageNet-R (93.8%)
  • Exceptional results on specialized datasets like Stanford Cars (96%) and Oxford-IIIT Pet (96.7%)
  • Effective text-image matching and retrieval

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its use of Data Filtering Networks to automatically curate training data from a massive pool of 43B image-text pairs, resulting in higher quality training on 5B selected pairs. This approach combines scale with quality to achieve superior performance.

Q: What are the recommended use cases?

The model excels in zero-shot image classification, text-image matching, and visual understanding tasks. It's particularly suitable for applications requiring robust performance across diverse domains, from fine-grained classification to general object recognition.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.