llava-onevision-qwen2-72b-ov-chat

Maintained By
lmms-lab

LLaVA-OneVision Qwen2 72B Chat Model

PropertyValue
Parameter Count73.2B
LicenseApache 2.0
LanguagesEnglish, Chinese
PaperLLaVA-OneVision Paper
Tensor TypeBF16

What is llava-onevision-qwen2-72b-ov-chat?

LLaVA-OneVision is a state-of-the-art multimodal model specifically optimized for chat scenarios. Built upon the powerful Qwen2 architecture, this model represents a significant advancement in vision-language AI, capable of processing and discussing both images and videos. The model has undergone iterative DPO (Direct Preference Optimization) training with human preferences, making it particularly well-suited for natural conversational interactions.

Implementation Details

The model architecture combines SO400M with Qwen2, implementing a sophisticated training pipeline that includes multiple stages: LCS-558K pretraining, synthetic data training with 4.7M samples, single-image training with 3.6M samples, and a final OneVision stage with 1.6M mixed-format data. The model uses bfloat16 precision and was trained using 256 Nvidia Tesla A100 GPUs.

  • Comprehensive multimodal support for images and videos
  • Iterative DPO training with human preference optimization
  • Advanced vision-language capabilities with SO400M + Qwen2 architecture
  • Trained on the LLaVA-OneVision Dataset

Core Capabilities

  • Multi-format visual processing (single-image, multi-image, video)
  • Bilingual support (English and Chinese)
  • Natural conversational interactions
  • High-quality visual understanding and description
  • Instruction-following while maintaining chat capabilities

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its comprehensive vision-language capabilities, enhanced by iterative DPO training and the ability to handle multiple visual formats. The combination of SO400M and Qwen2 architectures, along with extensive training on diverse datasets, makes it particularly effective for chat-based interactions.

Q: What are the recommended use cases?

The model is ideal for applications requiring natural conversation about visual content, including image analysis, visual question-answering, and video content discussion. It's particularly well-suited for bilingual applications requiring sophisticated visual understanding and natural language interaction.

The first platform built for prompt engineering