nanoLLaVA

Maintained By
qnguyen3

nanoLLaVA

PropertyValue
Parameter Count1.05B
Model TypeVision-Language Model
LicenseApache-2.0
Tensor TypeBF16

What is nanoLLaVA?

nanoLLaVA is a compact yet powerful vision-language model designed specifically for edge device deployment. Built on the foundation of Quyen-SE-v0.1 (Qwen1.5-0.5B) as its base LLM and utilizing google/siglip-so400m-patch14-384 as its vision encoder, this model achieves impressive performance despite its relatively small size of 1.05B parameters.

Implementation Details

The model follows the ChatML standard for prompt formatting and can be easily implemented using the transformers library. It supports both CPU and CUDA implementations, with optimized inference through PyTorch.

  • Base LLM: Quyen-SE-v0.1 (Qwen1.5-0.5B)
  • Vision Encoder: google/siglip-so400m-patch14-384
  • Tensor Format: BF16
  • Comprehensive multimodal understanding capabilities

Core Capabilities

  • VQA v2 Score: 70.84
  • TextVQA Performance: 46.71
  • ScienceQA Accuracy: 58.97
  • POPE Score: 84.1
  • MMMU Test Performance: 28.6
  • GQA Score: 54.79

Frequently Asked Questions

Q: What makes this model unique?

nanoLLaVA stands out for its efficient design that enables deployment on edge devices while maintaining strong performance across various vision-language tasks. Its compact size of 1.05B parameters makes it particularly suitable for resource-constrained environments.

Q: What are the recommended use cases?

The model is ideal for applications requiring visual question answering, image description, and general vision-language understanding tasks on edge devices. It's particularly effective for scenarios where computational resources are limited but reliable multimodal understanding is necessary.

The first platform built for prompt engineering