llava-onevision-qwen2-7b-ov

Maintained By
lmms-lab

LLaVA-OneVision Qwen2 7B

PropertyValue
Parameter Count8.03B
LicenseApache 2.0
LanguagesEnglish, Chinese
PaperLLaVA-OneVision Paper
Training DataLLaVA-OneVision Dataset

What is llava-onevision-qwen2-7b-ov?

LLaVA-OneVision is a state-of-the-art multimodal model built on the Qwen2 architecture, designed to process and understand both images and videos. With 8.03B parameters and trained using bfloat16 precision, it represents a significant advancement in visual-language understanding, achieving impressive performance across multiple benchmarks.

Implementation Details

The model utilizes a sophisticated architecture combining SO400M with Qwen2, implementing a four-stage training process: LCS-558K pretraining, mid-stage training on 4.7M synthetic data, final-image stage with 3.6M single-image data, and OneVision stage using 1.6M mixed media data.

  • Context window of 32K tokens
  • Trained on 256 Nvidia Tesla A100 GPUs
  • Implements Huggingface Trainer and PyTorch framework
  • Supports both image and video processing capabilities

Core Capabilities

  • 90.2% accuracy on DocVQA benchmark
  • 80.8% accuracy on MMBench
  • 96.0% accuracy on Science-QA
  • Effective processing of multi-image and video inputs
  • Bilingual support for English and Chinese

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its comprehensive training approach across multiple stages and ability to handle various visual inputs, from single images to videos, while maintaining high performance across diverse benchmarks.

Q: What are the recommended use cases?

The model excels in document analysis, scientific question answering, chart interpretation, and general visual-language tasks, making it suitable for educational, research, and commercial applications requiring sophisticated visual understanding.

The first platform built for prompt engineering