Ovis1.6-Gemma2-9B

Maintained By
AIDC-AI

Ovis1.6-Gemma2-9B

PropertyValue
Parameter Count10.2B parameters
Model TypeMultimodal LLM
LicenseApache 2.0
PaperarXiv:2405.20797
ArchitectureGemma2-9B + SigLIP-400M

What is Ovis1.6-Gemma2-9B?

Ovis1.6-Gemma2-9B is a cutting-edge multimodal large language model that combines the power of Gemma2-9B language model with SigLIP-400M vision encoder. It's designed to structurally align visual and textual embeddings, enabling sophisticated image-text understanding and generation tasks. This model leads the OpenCompass benchmark among open-source MLLMs within the 30B parameter range, despite having only 10.2B parameters.

Implementation Details

The model implements a novel architecture that enhances high-resolution image processing capabilities through structural embedding alignment. It utilizes BF16 tensor types and supports batch processing with a multimodal maximum length of 8192 tokens.

  • Built on the foundation of Ovis1.5 with significant improvements
  • Trained on a larger, more diverse dataset
  • Implements DPO training following instruction-tuning
  • Utilizes advanced visual-textual alignment techniques

Core Capabilities

  • High-resolution image processing and understanding
  • Advanced text-image alignment and comprehension
  • Efficient multimodal processing with relatively small parameter count
  • Batch inference support for multiple images
  • Flexible prompt formatting with image-text combinations

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to achieve state-of-the-art performance with just 10.2B parameters, significantly less than competitors, while maintaining high-quality multimodal understanding through structural embedding alignment.

Q: What are the recommended use cases?

The model excels in image-text tasks including image description, visual question answering, and multimodal understanding. It's particularly suitable for applications requiring efficient processing of combined image and text inputs.

The first platform built for prompt engineering