maxvit_base_tf_512.in21k_ft_in1k

Maintained By
timm

MaxViT Base TF 512 Model

PropertyValue
Parameter Count120M
Model TypeImage Classification
LicenseApache 2.0
Input Resolution512x512
Top-1 Accuracy88.20%
PaperMaxViT: Multi-Axis Vision Transformer

What is maxvit_base_tf_512.in21k_ft_in1k?

This is a MaxViT (Multi-Axis Vision Transformer) base model that combines the strengths of both convolutional neural networks and vision transformers. Initially pretrained on ImageNet-21k and fine-tuned on ImageNet-1k, it represents a significant advancement in vision model architecture by incorporating both local and global processing capabilities.

Implementation Details

The model features a unique architecture that incorporates MBConv blocks with self-attention mechanisms, using both window and grid partitioning schemes. With 120M parameters, it processes images at 512x512 resolution and achieves 88.20% top-1 accuracy on ImageNet-1k.

  • Uniform blocks across stages combining MBConv and self-attention
  • Dual partitioning scheme with window and grid attention
  • 138.02 GMACs computational complexity
  • 703.99M activations

Core Capabilities

  • High-resolution image classification (512x512)
  • Feature extraction capabilities
  • Efficient processing with balanced local-global attention
  • Strong performance on complex visual tasks

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely combines MBConv blocks with dual-attention mechanisms, allowing it to process both local and global image features effectively. Its pre-training on ImageNet-21k followed by ImageNet-1k fine-tuning gives it robust feature recognition capabilities.

Q: What are the recommended use cases?

The model is particularly well-suited for high-resolution image classification tasks, feature extraction, and as a backbone for downstream computer vision tasks. Its 512x512 resolution makes it ideal for applications requiring detailed image analysis.

The first platform built for prompt engineering