Ovis1.6-Gemma2-9B
Property | Value |
---|---|
Parameter Count | 10.2B parameters |
Model Type | Multimodal LLM |
License | Apache 2.0 |
Paper | arXiv:2405.20797 |
Architecture | Gemma2-9B + SigLIP-400M |
What is Ovis1.6-Gemma2-9B?
Ovis1.6-Gemma2-9B is a cutting-edge multimodal large language model that combines the power of Gemma2-9B language model with SigLIP-400M vision encoder. It's designed to structurally align visual and textual embeddings, enabling sophisticated image-text understanding and generation tasks. This model leads the OpenCompass benchmark among open-source MLLMs within the 30B parameter range, despite having only 10.2B parameters.
Implementation Details
The model implements a novel architecture that enhances high-resolution image processing capabilities through structural embedding alignment. It utilizes BF16 tensor types and supports batch processing with a multimodal maximum length of 8192 tokens.
- Built on the foundation of Ovis1.5 with significant improvements
- Trained on a larger, more diverse dataset
- Implements DPO training following instruction-tuning
- Utilizes advanced visual-textual alignment techniques
Core Capabilities
- High-resolution image processing and understanding
- Advanced text-image alignment and comprehension
- Efficient multimodal processing with relatively small parameter count
- Batch inference support for multiple images
- Flexible prompt formatting with image-text combinations
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to achieve state-of-the-art performance with just 10.2B parameters, significantly less than competitors, while maintaining high-quality multimodal understanding through structural embedding alignment.
Q: What are the recommended use cases?
The model excels in image-text tasks including image description, visual question answering, and multimodal understanding. It's particularly suitable for applications requiring efficient processing of combined image and text inputs.