vit-mae-base

Maintained By
facebook

Vision Transformer MAE Base

PropertyValue
Parameter Count112M parameters
LicenseApache 2.0
FrameworkPyTorch, TensorFlow
PaperMAE Paper

What is vit-mae-base?

The vit-mae-base is a Vision Transformer model pre-trained using the Masked Autoencoder (MAE) approach. Developed by Facebook, this model implements a self-supervised learning strategy where 75% of image patches are randomly masked during pre-training, allowing the model to learn robust visual representations.

Implementation Details

This implementation follows a BERT-like transformer encoder architecture where images are processed as sequences of fixed-size patches. The model uses a unique masking strategy during pre-training, where it first encodes visible patches, then uses a learnable mask token to reconstruct the masked portions.

  • Employs transformer encoder architecture for image processing
  • Uses patch-based image representation
  • Implements 75% masking ratio during pre-training
  • Supports both PyTorch and TensorFlow frameworks

Core Capabilities

  • Image classification and feature extraction
  • Self-supervised pre-training on ImageNet-1K dataset
  • Efficient reconstruction of masked image patches
  • Support for downstream task fine-tuning

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its high masking ratio (75%) during pre-training and its efficient self-supervised learning approach, which has shown remarkable results without requiring labeled data.

Q: What are the recommended use cases?

The model is particularly well-suited for image classification tasks and can be fine-tuned for specific vision tasks. It's especially valuable when working with limited labeled data due to its self-supervised pre-training approach.

The first platform built for prompt engineering