Ovis1.6-Gemma2-9B

Maintained By
AIDC-AI

Ovis1.6-Gemma2-9B

PropertyValue
Parameter Count10.2B parameters
Model TypeMultimodal LLM
LicenseApache 2.0
PaperarXiv:2405.20797
ArchitectureGemma2-9B + SigLIP-400M

What is Ovis1.6-Gemma2-9B?

Ovis1.6-Gemma2-9B is a cutting-edge multimodal large language model that combines the power of Gemma2-9B language model with SigLIP-400M vision encoder. It's designed to structurally align visual and textual embeddings, enabling sophisticated image-text understanding and generation tasks. This model leads the OpenCompass benchmark among open-source MLLMs within the 30B parameter range, despite having only 10.2B parameters.

Implementation Details

The model implements a novel architecture that enhances high-resolution image processing capabilities through structural embedding alignment. It utilizes BF16 tensor types and supports batch processing with a multimodal maximum length of 8192 tokens.

  • Built on the foundation of Ovis1.5 with significant improvements
  • Trained on a larger, more diverse dataset
  • Implements DPO training following instruction-tuning
  • Utilizes advanced visual-textual alignment techniques

Core Capabilities

  • High-resolution image processing and understanding
  • Advanced text-image alignment and comprehension
  • Efficient multimodal processing with relatively small parameter count
  • Batch inference support for multiple images
  • Flexible prompt formatting with image-text combinations

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to achieve state-of-the-art performance with just 10.2B parameters, significantly less than competitors, while maintaining high-quality multimodal understanding through structural embedding alignment.

Q: What are the recommended use cases?

The model excels in image-text tasks including image description, visual question answering, and multimodal understanding. It's particularly suitable for applications requiring efficient processing of combined image and text inputs.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.