X-CLIP Base Patch32
Property | Value |
---|---|
Parameter Count | 197M |
License | MIT |
Paper | View Paper |
Top-1 Accuracy | 80.4% |
Training Dataset | Kinetics-400 |
What is xclip-base-patch32?
X-CLIP base-patch32 is a sophisticated video-language understanding model that extends the capabilities of CLIP for video analysis. Developed by Microsoft, this model processes video content using 8 frames per video at 224x224 resolution, with a patch size of 32 pixels. It's designed for video-text matching and classification tasks, trained on the Kinetics-400 dataset.
Implementation Details
The model employs a transformer-based architecture that processes video frames through a patch-based approach. It uses contrastive learning to understand relationships between video content and textual descriptions.
- Processes 8 frames per video input
- Uses 224x224 resolution for frame analysis
- Implements 32x32 pixel patches for processing
- Achieves 95.0% top-5 accuracy on classification tasks
Core Capabilities
- Zero-shot video classification
- Few-shot learning capabilities
- Video-text retrieval
- Fully supervised video classification
Frequently Asked Questions
Q: What makes this model unique?
This model uniquely extends CLIP's vision-language capabilities to video understanding, offering strong performance on video classification tasks while maintaining a relatively compact parameter count of 197M.
Q: What are the recommended use cases?
The model excels at video classification tasks, video-text matching, and can be used for both zero-shot and few-shot learning scenarios. It's particularly effective for applications requiring understanding of video content in relation to textual descriptions.