Ovis1.6-Llama3.2-3B

Maintained By
AIDC-AI

Ovis1.6-Llama3.2-3B

PropertyValue
Parameter Count4.14B
Model TypeMultimodal LLM
ArchitectureSiglip-400M + Llama-3.2-3B-Instruct
LicenseApache 2.0
PaperarXiv:2405.20797

What is Ovis1.6-Llama3.2-3B?

Ovis1.6-Llama3.2-3B is a cutting-edge Multimodal Large Language Model (MLLM) that sets the benchmark for edge-side multimodal tasks. Built on the innovative Ovis architecture, it employs a unique approach to structurally align visual and textual embeddings, making it particularly effective for local intelligence and on-device computing scenarios.

Implementation Details

The model combines a Siglip-400M visual encoder with a Llama-3.2-3B-Instruct language model, enhanced through DPO training following instruction-tuning. It supports high-resolution image processing and operates with BF16 tensor precision.

  • Improved high-resolution image processing capabilities
  • Trained on larger, more diverse, and higher-quality dataset
  • Enhanced training process with DPO methodology
  • Supports batch inference for multiple images

Core Capabilities

  • State-of-the-art performance in OpenCompass benchmark for models under 4B parameters
  • Surpasses Llama-3.2-11B-Vision-Instruct in benchmarks
  • Efficient edge computing and on-device processing
  • Multimodal processing with up to 8192 token length

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its exceptional performance-to-size ratio, achieving SOTA results with just 4.14B parameters, making it ideal for edge computing applications while maintaining high accuracy in multimodal tasks.

Q: What are the recommended use cases?

The model is particularly well-suited for edge computing scenarios, local intelligence applications, and on-device multimodal processing where efficient resource usage is crucial while maintaining high performance in image-text tasks.

The first platform built for prompt engineering