llava-llama-3-8b-v1_1
Property | Value |
---|---|
Parameter Count | 8.03B |
Model Type | Image-Text-to-Text |
Architecture | LLaVA with CLIP-ViT-Large |
Tensor Type | FP16 |
What is llava-llama-3-8b-v1_1?
llava-llama-3-8b-v1_1 is an advanced multimodal model that combines Meta's Llama-3-8B-Instruct architecture with CLIP-ViT-Large visual encoding capabilities. It's specifically designed to handle complex image-text interactions, fine-tuned on ShareGPT4V-PT and InternVL-SFT datasets for enhanced performance.
Implementation Details
The model utilizes a sophisticated architecture combining a CLIP-L visual encoder with an MLP projector, operating at a resolution of 336. It employs a strategic training approach with frozen LLM and ViT during pretraining, followed by full LLM training with LoRA ViT during fine-tuning.
- Visual Encoder: CLIP-ViT-Large-patch14-336
- Base Model: meta-llama/Meta-Llama-3-8B-Instruct
- Training Strategy: Full LLM with LoRA ViT fine-tuning
- Dataset Size: 1246K pretraining + 1268K fine-tuning samples
Core Capabilities
- 72.3% accuracy on MMBench Test (EN)
- 66.4% accuracy on MMBench Test (CN)
- 70.0% accuracy on AI2D Test
- Robust performance across multiple vision-language tasks
- Enhanced multilingual capabilities
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its improved performance metrics compared to previous versions, particularly in MMBench and AI2D tests. It leverages a unique combination of ShareGPT4V-PT and InternVL-SFT datasets, resulting in better cross-modal understanding.
Q: What are the recommended use cases?
The model excels in vision-language tasks including visual question answering, image understanding, and multilingual image-text interactions. It's particularly suitable for applications requiring detailed image analysis and natural language responses.