LLaVA-OneVision Qwen2 72B

Property	Value
Parameter Count	73.2B
License	Apache 2.0
Languages	English, Chinese
Paper	LLaVA-OneVision Paper
Training Data	LLaVA-OneVision Dataset

What is llava-onevision-qwen2-72b-si?

LLaVA-OneVision is a state-of-the-art multimodal model built on the Qwen2 architecture with 73.2B parameters. It excels in processing both single images and multi-image inputs, featuring a substantial 32K token context window. The model demonstrates exceptional performance across various visual understanding benchmarks, including 93.5% accuracy on DocVQA and 86.6% on MMBench.

Implementation Details

The model architecture combines SO400M with Qwen2, trained through multiple stages including LCS-558K pretraining, synthetic data training (4.7M samples), and final image training (3.6M samples). Training utilized 256 NVIDIA A100 GPUs with bfloat16 precision.

Multi-stage training pipeline with progressive data complexity
Extensive training on high-quality synthetic and real image data
Optimized for both English and Chinese language processing

Core Capabilities

Advanced image understanding and analysis
Multi-image and video processing
Strong performance on scientific and mathematical visual tasks
Document and chart analysis (84.9% accuracy on ChartQA)
Real-world visual question answering (73.8% on RealWorldQA)

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its comprehensive training on the LLaVA-OneVision Dataset, enabling it to handle diverse visual tasks from document analysis to scientific reasoning. Its large parameter count and multi-stage training approach contribute to superior performance across benchmarks.

Q: What are the recommended use cases?

The model excels in document analysis, scientific visualization interpretation, mathematical problem-solving, and general visual question-answering tasks. It's particularly well-suited for applications requiring detailed understanding of complex visual information in both academic and real-world contexts.