llava-onevision-qwen2-0.5b-si

Maintained By
lmms-lab

LLaVA-OneVision Qwen2 0.5B

PropertyValue
Parameter Count894M
Model TypeMultimodal Vision-Language
ArchitectureSO400M + Qwen2
LicenseApache 2.0
PaperLLaVA-OneVision Paper

What is llava-onevision-qwen2-0.5b-si?

LLaVA-OneVision is a powerful multimodal model that combines vision and language capabilities, built on the Qwen2 architecture with 894M parameters. It's designed to process and understand both images and videos while supporting both English and Chinese interactions. The model features a 32K token context window and demonstrates impressive performance across various visual-language tasks.

Implementation Details

The model underwent a comprehensive training process including multiple stages: LCS-558K pretraining, 4.7M high-quality synthetic data training, 3.6M single-image data training, and a final OneVision stage with 1.6M mixed-format data. It utilizes bfloat16 precision and was trained using 256 Nvidia Tesla A100 GPUs.

  • Trained on LLaVA-OneVision Dataset
  • Supports both single-image and multi-image processing
  • Implements advanced video understanding capabilities
  • Uses Huggingface Trainer and PyTorch framework

Core Capabilities

  • Strong performance on DocVQA (75.0% accuracy)
  • Excellent ImageDC task handling (83.0% accuracy)
  • Robust LLaVA-W performance (71.2% accuracy)
  • Science-QA capability (67.8% accuracy)
  • Bilingual support (English and Chinese)

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to handle multiple types of visual inputs (single images, multiple images, and videos) while maintaining strong performance across various benchmarks makes it stand out. Its compact size of 894M parameters while achieving competitive results is particularly notable.

Q: What are the recommended use cases?

The model excels in document visual question answering, image analysis, scientific question answering, and general visual-language tasks. It's particularly well-suited for applications requiring bilingual support and diverse visual input processing.

The first platform built for prompt engineering