EVA02 Small Patch14 336

Property	Value
Parameter Count	22.1M
Image Size	336x336
License	MIT
Paper	EVA-02: A Visual Representation for Neon Genesis
Top-1 Accuracy	85.74%

What is eva02_small_patch14_336.mim_in22k_ft_in1k?

This is a small-scale variant of the EVA02 vision transformer architecture, designed for efficient image classification and feature extraction. It was first pre-trained on ImageNet-22k using masked image modeling with EVA-CLIP as a teacher, then fine-tuned on ImageNet-1k for optimal performance.

Implementation Details

The model implements several key architectural innovations including mean pooling, SwiGLU activation functions, and Rotary Position Embeddings (ROPE). It processes images at 336x336 resolution using 14x14 patches, achieving 15.5 GMACs with 54.3M activations.

Pre-trained on ImageNet-22k dataset
Fine-tuned on ImageNet-1k
Uses mean pooling for feature aggregation
Implements SwiGLU activation
Incorporates Rotary Position Embeddings

Core Capabilities

Image Classification
Feature Extraction
Transfer Learning
Visual Representation Learning

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient architecture that balances performance and size, achieving 85.74% top-1 accuracy on ImageNet-1k while maintaining a relatively small parameter count of 22.1M. It incorporates modern architectural improvements like ROPE and SwiGLU, making it particularly effective for both classification and feature extraction tasks.

Q: What are the recommended use cases?

The model is well-suited for image classification tasks, particularly when working with high-resolution images (336x336). It's also effective for feature extraction in transfer learning scenarios, making it valuable for downstream computer vision tasks where pre-trained visual representations are needed.