LLaVA-OneVision Qwen2 72B
Property | Value |
---|---|
Parameter Count | 73.2B |
License | Apache 2.0 |
Languages | English, Chinese |
Paper | LLaVA-OneVision Paper |
Training Data | LLaVA-OneVision Dataset |
What is llava-onevision-qwen2-72b-si?
LLaVA-OneVision is a state-of-the-art multimodal model built on the Qwen2 architecture with 73.2B parameters. It excels in processing both single images and multi-image inputs, featuring a substantial 32K token context window. The model demonstrates exceptional performance across various visual understanding benchmarks, including 93.5% accuracy on DocVQA and 86.6% on MMBench.
Implementation Details
The model architecture combines SO400M with Qwen2, trained through multiple stages including LCS-558K pretraining, synthetic data training (4.7M samples), and final image training (3.6M samples). Training utilized 256 NVIDIA A100 GPUs with bfloat16 precision.
- Multi-stage training pipeline with progressive data complexity
- Extensive training on high-quality synthetic and real image data
- Optimized for both English and Chinese language processing
Core Capabilities
- Advanced image understanding and analysis
- Multi-image and video processing
- Strong performance on scientific and mathematical visual tasks
- Document and chart analysis (84.9% accuracy on ChartQA)
- Real-world visual question answering (73.8% on RealWorldQA)
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its comprehensive training on the LLaVA-OneVision Dataset, enabling it to handle diverse visual tasks from document analysis to scientific reasoning. Its large parameter count and multi-stage training approach contribute to superior performance across benchmarks.
Q: What are the recommended use cases?
The model excels in document analysis, scientific visualization interpretation, mathematical problem-solving, and general visual question-answering tasks. It's particularly well-suited for applications requiring detailed understanding of complex visual information in both academic and real-world contexts.