MaxViT Base TF 512 Model
Property | Value |
---|---|
Parameter Count | 120M |
Model Type | Image Classification |
License | Apache 2.0 |
Input Resolution | 512x512 |
Top-1 Accuracy | 88.20% |
Paper | MaxViT: Multi-Axis Vision Transformer |
What is maxvit_base_tf_512.in21k_ft_in1k?
This is a MaxViT (Multi-Axis Vision Transformer) base model that combines the strengths of both convolutional neural networks and vision transformers. Initially pretrained on ImageNet-21k and fine-tuned on ImageNet-1k, it represents a significant advancement in vision model architecture by incorporating both local and global processing capabilities.
Implementation Details
The model features a unique architecture that incorporates MBConv blocks with self-attention mechanisms, using both window and grid partitioning schemes. With 120M parameters, it processes images at 512x512 resolution and achieves 88.20% top-1 accuracy on ImageNet-1k.
- Uniform blocks across stages combining MBConv and self-attention
- Dual partitioning scheme with window and grid attention
- 138.02 GMACs computational complexity
- 703.99M activations
Core Capabilities
- High-resolution image classification (512x512)
- Feature extraction capabilities
- Efficient processing with balanced local-global attention
- Strong performance on complex visual tasks
Frequently Asked Questions
Q: What makes this model unique?
This model uniquely combines MBConv blocks with dual-attention mechanisms, allowing it to process both local and global image features effectively. Its pre-training on ImageNet-21k followed by ImageNet-1k fine-tuning gives it robust feature recognition capabilities.
Q: What are the recommended use cases?
The model is particularly well-suited for high-resolution image classification tasks, feature extraction, and as a backbone for downstream computer vision tasks. Its 512x512 resolution makes it ideal for applications requiring detailed image analysis.