xclip-base-patch32-16-frames

xclip-base-patch32-16-frames

microsoft

X-CLIP base model for video-text understanding with 197M params. Achieves 81.1% top-1 accuracy on Kinetics-400. Uses 16-frame input at 224x224 resolution.

PropertyValue
Parameter Count197M
LicenseMIT
PaperView Paper
Top-1 Accuracy81.1%
FrameworkPyTorch

What is xclip-base-patch32-16-frames?

X-CLIP is a sophisticated extension of the CLIP architecture designed specifically for video-language understanding. This base model variant processes videos using 16 frames at 224x224 resolution with 32-pixel patches. It was developed by Microsoft and trained on the Kinetics-400 dataset, achieving impressive accuracy metrics in video classification tasks.

Implementation Details

The model employs a transformer-based architecture that processes video and text inputs in a contrastive learning framework. It handles video data by analyzing 16 temporal frames, making it efficient for video understanding tasks while maintaining high accuracy.

  • Uses 32x32 pixel patches for video frame processing
  • Processes 16 frames per video sequence
  • Implements contrastive learning between video and text pairs
  • Trained on Kinetics-400 dataset

Core Capabilities

  • Video classification with 81.1% top-1 and 95.5% top-5 accuracy
  • Zero-shot and few-shot video classification
  • Video-text retrieval
  • Feature extraction for downstream tasks

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely extends CLIP's capabilities to video understanding while maintaining a relatively compact architecture (197M parameters). It achieves high accuracy while processing multiple frames, making it practical for real-world applications.

Q: What are the recommended use cases?

The model excels in video classification tasks, video-text retrieval, and can be used for zero-shot learning scenarios. It's particularly suitable for applications requiring understanding of video content in relation to textual descriptions.

Related Models

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026