llava-onevision-qwen2-7b-si

Maintained By
lmms-lab

LLaVA-OneVision-Qwen2-7B

PropertyValue
Parameter Count8.03B
Model TypeMultimodal
ArchitectureSO400M + Qwen2
LicenseApache 2.0
Paperarxiv.org/pdf/2408.03326

What is llava-onevision-qwen2-7b-si?

LLaVA-OneVision is a powerful multimodal model that combines vision and language capabilities, built on the Qwen2 language model architecture. It's designed to handle both single-image, multi-image, and video inputs with a substantial 32K token context window. The model demonstrates impressive performance across various benchmarks, including achieving 81.7% accuracy on MMBench and 96.6% on Science-QA.

Implementation Details

The model underwent a four-stage training process, including pretraining on LCS-558K data, followed by training on 4.7M high-quality synthetic data, 3.6M single-image data, and finally 1.6M mixed media data. It's implemented using PyTorch and trained using bfloat16 precision on 256 Nvidia Tesla A100 GPUs.

  • Supports both English and Chinese languages
  • Built on LLaVA-OneVision Dataset
  • Uses Huggingface Trainer for orchestration
  • Implements a context window of 32K tokens

Core Capabilities

  • High-performance visual question answering (89.3% on DocVQA)
  • Advanced chart and diagram comprehension (78.8% on ChartQA)
  • Strong scientific reasoning capabilities (96.6% on Science-QA)
  • Efficient processing of multiple image types and videos
  • Bilingual support for English and Chinese

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its comprehensive training approach across multiple stages and ability to handle various types of visual inputs, from single images to videos, while maintaining high performance across diverse benchmarks.

Q: What are the recommended use cases?

The model excels in document analysis, scientific reasoning, chart interpretation, and general visual question-answering tasks. It's particularly suitable for applications requiring both image and text understanding in educational, scientific, or business contexts.

The first platform built for prompt engineering