deit_small_patch16_224.fb_in1k

Maintained By
timm

DeiT Small Patch16 224

PropertyValue
Parameter Count22.1M
Image Size224 x 224
LicenseApache 2.0
PaperTraining data-efficient image transformers & distillation through attention
DatasetImageNet-1k

What is deit_small_patch16_224.fb_in1k?

DeiT (Data-efficient image Transformers) is a vision transformer model designed for efficient image classification. This small variant processes images by dividing them into 16x16 patches and utilizes attention mechanisms for feature learning, while maintaining a balanced trade-off between performance and computational resources.

Implementation Details

The model architecture is based on the Vision Transformer (ViT) framework but incorporates distillation through attention techniques. It processes 224x224 pixel images, dividing them into 16x16 patches, resulting in 196 patches plus one classification token. The model achieves efficient training through innovative attention distillation mechanisms.

  • 22.1M trainable parameters
  • 4.6 GMACs computational requirement
  • 11.9M activation size
  • Pretrained on ImageNet-1k dataset

Core Capabilities

  • Image Classification with high efficiency
  • Feature extraction backbone
  • Supports both classification and embedding generation
  • Efficient inference with 224x224 resolution images

Frequently Asked Questions

Q: What makes this model unique?

The model combines the power of transformers with efficient training techniques through attention distillation, making it more practical for real-world applications while maintaining strong performance on ImageNet-1k.

Q: What are the recommended use cases?

This model is ideal for image classification tasks, feature extraction, and as a backbone for downstream computer vision tasks. It's particularly suitable for applications requiring a good balance between model size and performance.

The first platform built for prompt engineering