DFN5B-CLIP-ViT-H-14-378
Property | Value |
---|---|
License | Apple Sample Code License |
Research Paper | Data Filtering Networks |
Training Data | 5B filtered images from 43B pool |
Architecture | CLIP ViT-H-14 |
What is DFN5B-CLIP-ViT-H-14-378?
DFN5B-CLIP-ViT-H-14-378 is an advanced CLIP model developed by Apple that leverages Data Filtering Networks (DFNs) to train on carefully curated image-text pairs. The model was trained on 5 billion images selected from a massive pool of 43 billion uncurated pairs, using innovative filtering techniques to ensure quality.
Implementation Details
The model implements a Vision Transformer (ViT) architecture with a hierarchical design, converted from JAX to PyTorch for wider accessibility. It processes images at 384x384 resolution and features both image and text encoding capabilities compatible with the OpenCLIP framework.
- Trained on 39B samples at 224x224 resolution + 5B at 384x384 resolution
- Achieves 84.2% accuracy on ImageNet-1K
- Implements contrastive learning for zero-shot classification
Core Capabilities
- Zero-shot image classification with strong performance across diverse datasets
- Robust performance on challenging variants like ImageNet-A (79.9%) and ImageNet-R (93.8%)
- Exceptional results on specialized datasets like Stanford Cars (96%) and Oxford-IIIT Pet (96.7%)
- Effective text-image matching and retrieval
Frequently Asked Questions
Q: What makes this model unique?
The model's uniqueness lies in its use of Data Filtering Networks to automatically curate training data from a massive pool of 43B image-text pairs, resulting in higher quality training on 5B selected pairs. This approach combines scale with quality to achieve superior performance.
Q: What are the recommended use cases?
The model excels in zero-shot image classification, text-image matching, and visual understanding tasks. It's particularly suitable for applications requiring robust performance across diverse domains, from fine-grained classification to general object recognition.