LLaVA-Video-7B-Qwen2-Video-Only
Property | Value |
---|---|
Parameter Count | 8.03B |
License | Apache 2.0 |
Paper | View Paper |
Base Model | llava-onevision-qwen2-7b-si |
Training Data | LLaVA-Video-178K |
What is LLaVA-Video-7B-Qwen2-Video-Only?
LLaVA-Video-7B-Qwen2-Video-Only is a specialized video understanding model that combines the Qwen2 language model with advanced video processing capabilities. Built with a 32K token context window, it can process up to 110 frames and achieves impressive results across multiple video benchmarks. The model represents a focused approach to video-only training, distinguishing itself from multi-modal variants.
Implementation Details
The model is implemented using the SO400M architecture combined with Qwen2, trained specifically on video data for one epoch. It utilizes bfloat16 precision and was trained using substantial computational resources (256 Nvidia Tesla A100 GPUs).
- Supports both English and Chinese language processing
- Achieves 82.2% accuracy on NextQA benchmark
- Demonstrates strong performance on various video understanding tasks
- Uses Hugging Face Trainer for orchestration
Core Capabilities
- Video frame processing up to 110 frames
- Detailed video description generation
- Strong performance on multiple benchmarks including ActNet-QA (58.2%), MLVU (69.8%), and PercepTest (71.7%)
- Efficient video processing with customizable frame sampling
Frequently Asked Questions
Q: What makes this model unique?
This model's uniqueness lies in its video-only training approach and its ability to handle extended video sequences with up to 110 frames while maintaining high performance across various benchmarks. It's particularly notable for achieving comparable results to more complex multi-modal variants.
Q: What are the recommended use cases?
The model is ideal for video understanding tasks, detailed video description generation, and video-based question answering. It's particularly well-suited for applications requiring comprehensive video analysis and natural language interaction with video content.