vit-mae-large
Property | Value |
---|---|
Author | |
License | Apache 2.0 |
Paper | Masked Autoencoders Are Scalable Vision Learners |
Framework | PyTorch, TensorFlow |
What is vit-mae-large?
vit-mae-large is a large-scale Vision Transformer model pre-trained using the Masked Autoencoder (MAE) approach. This model represents Facebook's implementation of a self-supervised learning method that masks and reconstructs large portions of input images. The model processes images as sequences of fixed-size patches and has been trained on the ImageNet-1K dataset.
Implementation Details
The model employs a BERT-like transformer encoder architecture with a unique pretraining strategy. During training, it masks 75% of image patches randomly, processes the visible patches through the encoder, and then reconstructs the masked portions using a decoder with learnable mask tokens. This high masking ratio is a key innovation that forces the model to develop robust visual representations.
- Transformer-based encoder-decoder architecture
- 75% masking ratio during pretraining
- Learnable shared mask tokens
- Reconstruction of raw pixel values
Core Capabilities
- Image classification tasks
- Feature extraction for downstream vision tasks
- Self-supervised visual representation learning
- Efficient processing of high-resolution images
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its unusually high masking ratio (75%) during pretraining, which is significantly higher than previous approaches. This aggressive masking strategy, combined with the large model size, enables more efficient and effective self-supervised learning.
Q: What are the recommended use cases?
The model is particularly well-suited for image classification tasks and can be fine-tuned for specific downstream vision tasks. It's especially valuable when working with large-scale image datasets where labeled data is limited, as it leverages self-supervised learning principles.