videomae-base-finetuned-kinetics

Maintained By
MCG-NJU

VideoMAE Base Model (Kinetics-400)

PropertyValue
Parameter Count86.5M
LicenseCC-BY-NC-4.0
PaperVideoMAE Paper
Accuracy80.9% Top-1, 94.7% Top-5
FrameworkPyTorch

What is videomae-base-finetuned-kinetics?

VideoMAE is an advanced video classification model that extends the Masked Autoencoder (MAE) architecture to video processing. This particular model has been pre-trained for 1600 epochs using self-supervised learning and then fine-tuned on the Kinetics-400 dataset, making it particularly effective for video classification tasks.

Implementation Details

The model processes videos as sequences of 16x16 fixed-size patches with linear embedding. It utilizes a Vision Transformer (ViT) architecture with a specialized decoder for predicting pixel values of masked patches. A [CLS] token is added at the sequence start for classification tasks, along with positional embeddings.

  • Transformer-based architecture with specialized video processing capabilities
  • Pre-trained using masked autoencoding technique
  • Fine-tuned on Kinetics-400 dataset
  • Supports 400 different video classification labels

Core Capabilities

  • High-accuracy video classification (80.9% top-1 accuracy)
  • Efficient processing of video sequences
  • Feature extraction for downstream tasks
  • Robust representation learning through masked prediction

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely combines masked autoencoding with video processing, achieving state-of-the-art results while being data-efficient. Its architecture is specifically designed to handle the temporal aspects of video data.

Q: What are the recommended use cases?

The model is ideal for video classification tasks, particularly those involving action recognition within the Kinetics-400 categories. It can also be used for feature extraction in custom video analysis pipelines.

The first platform built for prompt engineering