MC-LLaVA-3b

Maintained By
visheratin

MC-LLaVA-3b

PropertyValue
Parameter Count3.22B
Model TypeVision-Language Model
LicenseMIT
Tensor TypeBF16

What is MC-LLaVA-3b?

MC-LLaVA-3b is an innovative vision-language model that introduces a novel multi-crop approach to image processing. Instead of generating N embeddings for a single image, it creates K tokens for M parts of the image (crops), enabling more efficient processing of visual information from smaller image segments while maintaining a reasonable token count.

Implementation Details

The model is built upon the Phi-2 merge architecture and utilizes the SigLIP 400M vision tower for image processing. It implements the ChatML prompt format and can process images with customizable crop parameters for optimal performance.

  • Fine-tuned from Phi-2 merge base model
  • Incorporates SigLIP 400M vision tower
  • Supports variable crop sizes and token counts
  • Uses ChatML formatting for prompts

Core Capabilities

  • TextVQA: 50.9% accuracy
  • GQA: 59.5% accuracy
  • VQAv2: 76.72% accuracy
  • VizWiz: 32.68% accuracy
  • V*-bench performance: OCR (56.66%), GPT4V-hard (52.94%), Direct attributes (40.86%), Relative position (56.57%)

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive multi-crop approach allows it to process image segments more efficiently than traditional vision-language models, providing detailed analysis of specific image regions while maintaining computational efficiency.

Q: What are the recommended use cases?

The model excels in visual question answering, image understanding, and detailed visual analysis tasks. It's particularly effective for applications requiring precise analysis of specific image regions and general visual comprehension tasks.

The first platform built for prompt engineering