VideoMAE Base Model (Kinetics-400)

Property	Value
Parameter Count	86.5M
License	CC-BY-NC-4.0
Paper	VideoMAE Paper
Accuracy	80.9% Top-1, 94.7% Top-5
Framework	PyTorch

What is videomae-base-finetuned-kinetics?

VideoMAE is an advanced video classification model that extends the Masked Autoencoder (MAE) architecture to video processing. This particular model has been pre-trained for 1600 epochs using self-supervised learning and then fine-tuned on the Kinetics-400 dataset, making it particularly effective for video classification tasks.

Implementation Details

The model processes videos as sequences of 16x16 fixed-size patches with linear embedding. It utilizes a Vision Transformer (ViT) architecture with a specialized decoder for predicting pixel values of masked patches. A [CLS] token is added at the sequence start for classification tasks, along with positional embeddings.

Transformer-based architecture with specialized video processing capabilities
Pre-trained using masked autoencoding technique
Fine-tuned on Kinetics-400 dataset
Supports 400 different video classification labels

Core Capabilities

High-accuracy video classification (80.9% top-1 accuracy)
Efficient processing of video sequences
Feature extraction for downstream tasks
Robust representation learning through masked prediction

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely combines masked autoencoding with video processing, achieving state-of-the-art results while being data-efficient. Its architecture is specifically designed to handle the temporal aspects of video data.

Q: What are the recommended use cases?

The model is ideal for video classification tasks, particularly those involving action recognition within the Kinetics-400 categories. It can also be used for feature extraction in custom video analysis pipelines.

videomae-base-finetuned-kinetics