llava-phi-3-mini-gguf

Maintained By
xtuner

llava-phi-3-mini-gguf

PropertyValue
Parameter Count303M
Model TypeImage-to-Text
Visual EncoderCLIP-ViT-Large-patch14-336
Base ModelPhi-3-mini-4k-instruct
Authorxtuner

What is llava-phi-3-mini-gguf?

llava-phi-3-mini-gguf is a compact multimodal model that combines Microsoft's Phi-3-mini-4k-instruct with CLIP vision capabilities, optimized in GGUF format for efficient deployment. It represents a lightweight alternative to larger multimodal models while maintaining competitive performance across various vision-language tasks.

Implementation Details

The model is built on a hybrid architecture that combines a CLIP-ViT-Large visual encoder with a Phi-3-mini language model, connected through an MLP projector. It processes images at 336x336 resolution and uses a frozen visual encoder during training.

  • Fine-tuned using ShareGPT4V-PT and InternVL-SFT datasets
  • Employs full LLM and full ViT training strategy
  • Available in both FP16 and INT4 quantized versions
  • Supports Phi-3-instruct chat template

Core Capabilities

  • Strong performance on MMBench Test (69.2)
  • Competitive MMMU Val score (41.4)
  • Effective visual reasoning (POPE: 87.3)
  • Robust scientific QA capabilities (ScienceQA Test: 73.7)

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient design, combining a small parameter count (303M) with competitive performance across multiple benchmarks. It's particularly notable for achieving strong results despite its compact size compared to larger models like LLaVA-v1.5-7B.

Q: What are the recommended use cases?

The model is well-suited for tasks requiring visual understanding and text generation, including image description, visual question answering, and scientific reasoning. Its GGUF format makes it particularly suitable for deployment in resource-constrained environments.

The first platform built for prompt engineering