xclip-base-patch32-16-frames

Maintained By
microsoft

X-CLIP Base Model (patch32-16-frames)

PropertyValue
Parameter Count197M
LicenseMIT
PaperView Paper
Top-1 Accuracy81.1%
FrameworkPyTorch

What is xclip-base-patch32-16-frames?

X-CLIP is a sophisticated extension of the CLIP architecture designed specifically for video-language understanding. This base model variant processes videos using 16 frames at 224x224 resolution with 32-pixel patches. It was developed by Microsoft and trained on the Kinetics-400 dataset, achieving impressive accuracy metrics in video classification tasks.

Implementation Details

The model employs a transformer-based architecture that processes video and text inputs in a contrastive learning framework. It handles video data by analyzing 16 temporal frames, making it efficient for video understanding tasks while maintaining high accuracy.

  • Uses 32x32 pixel patches for video frame processing
  • Processes 16 frames per video sequence
  • Implements contrastive learning between video and text pairs
  • Trained on Kinetics-400 dataset

Core Capabilities

  • Video classification with 81.1% top-1 and 95.5% top-5 accuracy
  • Zero-shot and few-shot video classification
  • Video-text retrieval
  • Feature extraction for downstream tasks

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely extends CLIP's capabilities to video understanding while maintaining a relatively compact architecture (197M parameters). It achieves high accuracy while processing multiple frames, making it practical for real-world applications.

Q: What are the recommended use cases?

The model excels in video classification tasks, video-text retrieval, and can be used for zero-shot learning scenarios. It's particularly suitable for applications requiring understanding of video content in relation to textual descriptions.

The first platform built for prompt engineering