DeiT Base Patch16 224
Property | Value |
---|---|
Parameter Count | 86.6M |
License | Apache 2.0 |
Paper | Training data-efficient image transformers & distillation through attention |
Image Size | 224x224 |
GMACs | 17.6 |
What is deit_base_patch16_224.fb_in1k?
DeiT (Data-efficient image Transformers) is a vision transformer model trained on ImageNet-1k that achieves strong performance while being more data-efficient than traditional vision transformers. This particular variant uses 16x16 pixel patches and processes 224x224 images.
Implementation Details
The model implements a vision transformer architecture with attention-based distillation, allowing it to learn efficiently from limited data. It processes images by splitting them into 16x16 patches, embedding them, and passing them through transformer layers.
- 86.6M total parameters
- 17.6 GMACs computational complexity
- 23.9M activations
- Supports both classification and feature extraction
Core Capabilities
- Image classification on ImageNet-1k classes
- Feature extraction for downstream tasks
- Efficient training through attention-based knowledge distillation
- Support for both F32 tensor operations
Frequently Asked Questions
Q: What makes this model unique?
DeiT's uniqueness lies in its data-efficient training approach using attention-based distillation, allowing it to achieve strong performance with less training data than traditional vision transformers.
Q: What are the recommended use cases?
This model is ideal for image classification tasks, particularly when working with ImageNet-like datasets. It can also be used as a feature extractor for transfer learning applications, with the ability to output embeddings by removing the classification head.