MC-LLaVA-3b
Property | Value |
---|---|
Parameter Count | 3.22B |
Model Type | Vision-Language Model |
License | MIT |
Tensor Type | BF16 |
What is MC-LLaVA-3b?
MC-LLaVA-3b is an innovative vision-language model that introduces a novel multi-crop approach to image processing. Instead of generating N embeddings for a single image, it creates K tokens for M parts of the image (crops), enabling more efficient processing of visual information from smaller image segments while maintaining a reasonable token count.
Implementation Details
The model is built upon the Phi-2 merge architecture and utilizes the SigLIP 400M vision tower for image processing. It implements the ChatML prompt format and can process images with customizable crop parameters for optimal performance.
- Fine-tuned from Phi-2 merge base model
- Incorporates SigLIP 400M vision tower
- Supports variable crop sizes and token counts
- Uses ChatML formatting for prompts
Core Capabilities
- TextVQA: 50.9% accuracy
- GQA: 59.5% accuracy
- VQAv2: 76.72% accuracy
- VizWiz: 32.68% accuracy
- V*-bench performance: OCR (56.66%), GPT4V-hard (52.94%), Direct attributes (40.86%), Relative position (56.57%)
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive multi-crop approach allows it to process image segments more efficiently than traditional vision-language models, providing detailed analysis of specific image regions while maintaining computational efficiency.
Q: What are the recommended use cases?
The model excels in visual question answering, image understanding, and detailed visual analysis tasks. It's particularly effective for applications requiring precise analysis of specific image regions and general visual comprehension tasks.