LLaVA-OneVision Qwen2 0.5B
Property | Value |
---|---|
Parameter Count | 894M |
License | Apache 2.0 |
Languages | English, Chinese |
Architecture | SO400M + Qwen2 |
Training Data | LLaVA-OneVision Dataset |
What is llava-onevision-qwen2-0.5b-ov?
LLaVA-OneVision is a multimodal AI model that combines vision and language capabilities, built on the Qwen2 architecture with 894M parameters. It's designed to process both images and videos, supporting interaction in both English and Chinese. The model features a 32K token context window and utilizes BF16 precision for efficient processing.
Implementation Details
The model underwent a multi-stage training process, including pretraining on LCS-558K, followed by training on 4.7M high-quality synthetic data, 3.6M single-image data, and finally 1.6M mixed media data. It achieves impressive performance across various benchmarks, with notable scores in DocVQA (73.7%), LLaVA-W (74.2%), and nuScenesVQA (70.5%).
- Advanced multimodal processing capabilities for images and videos
- Multi-stage training approach for comprehensive understanding
- BF16 precision for optimal performance
- 32K token context window
Core Capabilities
- Image and video analysis
- Multilingual support (English/Chinese)
- High performance on document understanding tasks
- Visual question answering
- Chart and diagram interpretation
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to handle multiple types of visual inputs (images, videos) and its comprehensive training on diverse datasets makes it versatile for various applications. Its relatively small size (894M parameters) while maintaining strong performance is particularly noteworthy.
Q: What are the recommended use cases?
The model excels in document analysis, visual question answering, and general image understanding tasks. It's particularly well-suited for applications requiring both image and text processing, such as document analysis (DocVQA: 73.7%) and scientific diagram interpretation (AI2D: 57.1%).