videomae-large-finetuned-kinetics

Maintained By
MCG-NJU

VideoMAE Large Finetuned Kinetics

PropertyValue
Parameter Count304M
LicenseCC-BY-NC-4.0
PaperVideoMAE Paper
Accuracy84.7% (Top-1), 96.5% (Top-5)
FrameworkPyTorch

What is videomae-large-finetuned-kinetics?

VideoMAE is a sophisticated video classification model that extends the Masked Autoencoder (MAE) approach to video understanding. This large variant has been pre-trained for 1600 epochs using self-supervised learning and then fine-tuned on the Kinetics-400 dataset. The model processes videos as sequences of 16x16 fixed-size patches and employs a Vision Transformer architecture with additional decoder capabilities.

Implementation Details

The model architecture builds upon the Vision Transformer (ViT) framework, incorporating several key technical innovations:

  • Uses a [CLS] token for classification tasks
  • Employs sinus/cosinus position embeddings
  • Processes video inputs as 16x16 patch sequences
  • Implements a transformer encoder-decoder architecture
  • Utilizes masked autoencoding for self-supervised learning

Core Capabilities

  • Video classification across 400 Kinetics categories
  • Feature extraction for downstream tasks
  • High-accuracy prediction (84.7% top-1)
  • Efficient processing of video sequences
  • Robust representation learning

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its application of masked autoencoding to video data, achieving state-of-the-art performance while using an efficient self-supervised pre-training approach. The large parameter count (304M) and impressive accuracy metrics make it particularly suitable for complex video understanding tasks.

Q: What are the recommended use cases?

The model is primarily designed for video classification tasks, particularly within the Kinetics-400 dataset categories. It's well-suited for applications requiring high-accuracy video understanding, such as content categorization, action recognition, and video indexing systems.

The first platform built for prompt engineering