llava-phi-3-mini-gguf
Property | Value |
---|---|
Parameter Count | 303M |
Model Type | Image-to-Text |
Visual Encoder | CLIP-ViT-Large-patch14-336 |
Base Model | Phi-3-mini-4k-instruct |
Author | xtuner |
What is llava-phi-3-mini-gguf?
llava-phi-3-mini-gguf is a compact multimodal model that combines Microsoft's Phi-3-mini-4k-instruct with CLIP vision capabilities, optimized in GGUF format for efficient deployment. It represents a lightweight alternative to larger multimodal models while maintaining competitive performance across various vision-language tasks.
Implementation Details
The model is built on a hybrid architecture that combines a CLIP-ViT-Large visual encoder with a Phi-3-mini language model, connected through an MLP projector. It processes images at 336x336 resolution and uses a frozen visual encoder during training.
- Fine-tuned using ShareGPT4V-PT and InternVL-SFT datasets
- Employs full LLM and full ViT training strategy
- Available in both FP16 and INT4 quantized versions
- Supports Phi-3-instruct chat template
Core Capabilities
- Strong performance on MMBench Test (69.2)
- Competitive MMMU Val score (41.4)
- Effective visual reasoning (POPE: 87.3)
- Robust scientific QA capabilities (ScienceQA Test: 73.7)
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its efficient design, combining a small parameter count (303M) with competitive performance across multiple benchmarks. It's particularly notable for achieving strong results despite its compact size compared to larger models like LLaVA-v1.5-7B.
Q: What are the recommended use cases?
The model is well-suited for tasks requiring visual understanding and text generation, including image description, visual question answering, and scientific reasoning. Its GGUF format makes it particularly suitable for deployment in resource-constrained environments.