Ovis1.6-Llama3.2-3B
Property | Value |
---|---|
Parameter Count | 4.14B |
Model Type | Multimodal LLM |
Architecture | Siglip-400M + Llama-3.2-3B-Instruct |
License | Apache 2.0 |
Paper | arXiv:2405.20797 |
What is Ovis1.6-Llama3.2-3B?
Ovis1.6-Llama3.2-3B is a cutting-edge Multimodal Large Language Model (MLLM) that sets the benchmark for edge-side multimodal tasks. Built on the innovative Ovis architecture, it employs a unique approach to structurally align visual and textual embeddings, making it particularly effective for local intelligence and on-device computing scenarios.
Implementation Details
The model combines a Siglip-400M visual encoder with a Llama-3.2-3B-Instruct language model, enhanced through DPO training following instruction-tuning. It supports high-resolution image processing and operates with BF16 tensor precision.
- Improved high-resolution image processing capabilities
- Trained on larger, more diverse, and higher-quality dataset
- Enhanced training process with DPO methodology
- Supports batch inference for multiple images
Core Capabilities
- State-of-the-art performance in OpenCompass benchmark for models under 4B parameters
- Surpasses Llama-3.2-11B-Vision-Instruct in benchmarks
- Efficient edge computing and on-device processing
- Multimodal processing with up to 8192 token length
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its exceptional performance-to-size ratio, achieving SOTA results with just 4.14B parameters, making it ideal for edge computing applications while maintaining high accuracy in multimodal tasks.
Q: What are the recommended use cases?
The model is particularly well-suited for edge computing scenarios, local intelligence applications, and on-device multimodal processing where efficient resource usage is crucial while maintaining high performance in image-text tasks.