MC-LLaVA-3b

Property	Value
Parameter Count	3.22B
Model Type	Vision-Language Model
License	MIT
Tensor Type	BF16

What is MC-LLaVA-3b?

MC-LLaVA-3b is an innovative vision-language model that introduces a novel multi-crop approach to image processing. Instead of generating N embeddings for a single image, it creates K tokens for M parts of the image (crops), enabling more efficient processing of visual information from smaller image segments while maintaining a reasonable token count.

Implementation Details

The model is built upon the Phi-2 merge architecture and utilizes the SigLIP 400M vision tower for image processing. It implements the ChatML prompt format and can process images with customizable crop parameters for optimal performance.

Fine-tuned from Phi-2 merge base model
Incorporates SigLIP 400M vision tower
Supports variable crop sizes and token counts
Uses ChatML formatting for prompts

Core Capabilities

TextVQA: 50.9% accuracy
GQA: 59.5% accuracy
VQAv2: 76.72% accuracy
VizWiz: 32.68% accuracy
V*-bench performance: OCR (56.66%), GPT4V-hard (52.94%), Direct attributes (40.86%), Relative position (56.57%)

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive multi-crop approach allows it to process image segments more efficiently than traditional vision-language models, providing detailed analysis of specific image regions while maintaining computational efficiency.

Q: What are the recommended use cases?

The model excels in visual question answering, image understanding, and detailed visual analysis tasks. It's particularly effective for applications requiring precise analysis of specific image regions and general visual comprehension tasks.

MC-LLaVA-3b

MC-LLaVA-3b

What is MC-LLaVA-3b?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models