VideoMAE Base Model (Kinetics-400)
Property | Value |
---|---|
Parameter Count | 86.5M |
License | CC-BY-NC-4.0 |
Paper | VideoMAE Paper |
Accuracy | 80.9% Top-1, 94.7% Top-5 |
Framework | PyTorch |
What is videomae-base-finetuned-kinetics?
VideoMAE is an advanced video classification model that extends the Masked Autoencoder (MAE) architecture to video processing. This particular model has been pre-trained for 1600 epochs using self-supervised learning and then fine-tuned on the Kinetics-400 dataset, making it particularly effective for video classification tasks.
Implementation Details
The model processes videos as sequences of 16x16 fixed-size patches with linear embedding. It utilizes a Vision Transformer (ViT) architecture with a specialized decoder for predicting pixel values of masked patches. A [CLS] token is added at the sequence start for classification tasks, along with positional embeddings.
- Transformer-based architecture with specialized video processing capabilities
- Pre-trained using masked autoencoding technique
- Fine-tuned on Kinetics-400 dataset
- Supports 400 different video classification labels
Core Capabilities
- High-accuracy video classification (80.9% top-1 accuracy)
- Efficient processing of video sequences
- Feature extraction for downstream tasks
- Robust representation learning through masked prediction
Frequently Asked Questions
Q: What makes this model unique?
This model uniquely combines masked autoencoding with video processing, achieving state-of-the-art results while being data-efficient. Its architecture is specifically designed to handle the temporal aspects of video data.
Q: What are the recommended use cases?
The model is ideal for video classification tasks, particularly those involving action recognition within the Kinetics-400 categories. It can also be used for feature extraction in custom video analysis pipelines.