LLaVA-OneVision Qwen2 7B
Property | Value |
---|---|
Parameter Count | 8.03B |
License | Apache 2.0 |
Languages | English, Chinese |
Paper | LLaVA-OneVision Paper |
Training Data | LLaVA-OneVision Dataset |
What is llava-onevision-qwen2-7b-ov?
LLaVA-OneVision is a state-of-the-art multimodal model built on the Qwen2 architecture, designed to process and understand both images and videos. With 8.03B parameters and trained using bfloat16 precision, it represents a significant advancement in visual-language understanding, achieving impressive performance across multiple benchmarks.
Implementation Details
The model utilizes a sophisticated architecture combining SO400M with Qwen2, implementing a four-stage training process: LCS-558K pretraining, mid-stage training on 4.7M synthetic data, final-image stage with 3.6M single-image data, and OneVision stage using 1.6M mixed media data.
- Context window of 32K tokens
- Trained on 256 Nvidia Tesla A100 GPUs
- Implements Huggingface Trainer and PyTorch framework
- Supports both image and video processing capabilities
Core Capabilities
- 90.2% accuracy on DocVQA benchmark
- 80.8% accuracy on MMBench
- 96.0% accuracy on Science-QA
- Effective processing of multi-image and video inputs
- Bilingual support for English and Chinese
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its comprehensive training approach across multiple stages and ability to handle various visual inputs, from single images to videos, while maintaining high performance across diverse benchmarks.
Q: What are the recommended use cases?
The model excels in document analysis, scientific question answering, chart interpretation, and general visual-language tasks, making it suitable for educational, research, and commercial applications requiring sophisticated visual understanding.