llava-onevision-qwen2-0.5b-ov

Maintained By
lmms-lab

LLaVA-OneVision Qwen2 0.5B

PropertyValue
Parameter Count894M
LicenseApache 2.0
LanguagesEnglish, Chinese
ArchitectureSO400M + Qwen2
Training DataLLaVA-OneVision Dataset

What is llava-onevision-qwen2-0.5b-ov?

LLaVA-OneVision is a multimodal AI model that combines vision and language capabilities, built on the Qwen2 architecture with 894M parameters. It's designed to process both images and videos, supporting interaction in both English and Chinese. The model features a 32K token context window and utilizes BF16 precision for efficient processing.

Implementation Details

The model underwent a multi-stage training process, including pretraining on LCS-558K, followed by training on 4.7M high-quality synthetic data, 3.6M single-image data, and finally 1.6M mixed media data. It achieves impressive performance across various benchmarks, with notable scores in DocVQA (73.7%), LLaVA-W (74.2%), and nuScenesVQA (70.5%).

  • Advanced multimodal processing capabilities for images and videos
  • Multi-stage training approach for comprehensive understanding
  • BF16 precision for optimal performance
  • 32K token context window

Core Capabilities

  • Image and video analysis
  • Multilingual support (English/Chinese)
  • High performance on document understanding tasks
  • Visual question answering
  • Chart and diagram interpretation

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to handle multiple types of visual inputs (images, videos) and its comprehensive training on diverse datasets makes it versatile for various applications. Its relatively small size (894M parameters) while maintaining strong performance is particularly noteworthy.

Q: What are the recommended use cases?

The model excels in document analysis, visual question answering, and general image understanding tasks. It's particularly well-suited for applications requiring both image and text processing, such as document analysis (DocVQA: 73.7%) and scientific diagram interpretation (AI2D: 57.1%).

The first platform built for prompt engineering