nanoLLaVA
Property | Value |
---|---|
Parameter Count | 1.05B |
Model Type | Vision-Language Model |
License | Apache-2.0 |
Tensor Type | BF16 |
What is nanoLLaVA?
nanoLLaVA is a compact yet powerful vision-language model designed specifically for edge device deployment. Built on the foundation of Quyen-SE-v0.1 (Qwen1.5-0.5B) as its base LLM and utilizing google/siglip-so400m-patch14-384 as its vision encoder, this model achieves impressive performance despite its relatively small size of 1.05B parameters.
Implementation Details
The model follows the ChatML standard for prompt formatting and can be easily implemented using the transformers library. It supports both CPU and CUDA implementations, with optimized inference through PyTorch.
- Base LLM: Quyen-SE-v0.1 (Qwen1.5-0.5B)
- Vision Encoder: google/siglip-so400m-patch14-384
- Tensor Format: BF16
- Comprehensive multimodal understanding capabilities
Core Capabilities
- VQA v2 Score: 70.84
- TextVQA Performance: 46.71
- ScienceQA Accuracy: 58.97
- POPE Score: 84.1
- MMMU Test Performance: 28.6
- GQA Score: 54.79
Frequently Asked Questions
Q: What makes this model unique?
nanoLLaVA stands out for its efficient design that enables deployment on edge devices while maintaining strong performance across various vision-language tasks. Its compact size of 1.05B parameters makes it particularly suitable for resource-constrained environments.
Q: What are the recommended use cases?
The model is ideal for applications requiring visual question answering, image description, and general vision-language understanding tasks on edge devices. It's particularly effective for scenarios where computational resources are limited but reliable multimodal understanding is necessary.