llava-onevision-qwen2-72b-si

Maintained By
lmms-lab

LLaVA-OneVision Qwen2 72B

PropertyValue
Parameter Count73.2B
LicenseApache 2.0
LanguagesEnglish, Chinese
PaperLLaVA-OneVision Paper
Training DataLLaVA-OneVision Dataset

What is llava-onevision-qwen2-72b-si?

LLaVA-OneVision is a state-of-the-art multimodal model built on the Qwen2 architecture with 73.2B parameters. It excels in processing both single images and multi-image inputs, featuring a substantial 32K token context window. The model demonstrates exceptional performance across various visual understanding benchmarks, including 93.5% accuracy on DocVQA and 86.6% on MMBench.

Implementation Details

The model architecture combines SO400M with Qwen2, trained through multiple stages including LCS-558K pretraining, synthetic data training (4.7M samples), and final image training (3.6M samples). Training utilized 256 NVIDIA A100 GPUs with bfloat16 precision.

  • Multi-stage training pipeline with progressive data complexity
  • Extensive training on high-quality synthetic and real image data
  • Optimized for both English and Chinese language processing

Core Capabilities

  • Advanced image understanding and analysis
  • Multi-image and video processing
  • Strong performance on scientific and mathematical visual tasks
  • Document and chart analysis (84.9% accuracy on ChartQA)
  • Real-world visual question answering (73.8% on RealWorldQA)

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its comprehensive training on the LLaVA-OneVision Dataset, enabling it to handle diverse visual tasks from document analysis to scientific reasoning. Its large parameter count and multi-stage training approach contribute to superior performance across benchmarks.

Q: What are the recommended use cases?

The model excels in document analysis, scientific visualization interpretation, mathematical problem-solving, and general visual question-answering tasks. It's particularly well-suited for applications requiring detailed understanding of complex visual information in both academic and real-world contexts.

The first platform built for prompt engineering