llava-llama-3-8b-v1_1

Maintained By
xtuner

llava-llama-3-8b-v1_1

PropertyValue
Parameter Count8.03B
Model TypeImage-Text-to-Text
ArchitectureLLaVA with CLIP-ViT-Large
Tensor TypeFP16

What is llava-llama-3-8b-v1_1?

llava-llama-3-8b-v1_1 is an advanced multimodal model that combines Meta's Llama-3-8B-Instruct architecture with CLIP-ViT-Large visual encoding capabilities. It's specifically designed to handle complex image-text interactions, fine-tuned on ShareGPT4V-PT and InternVL-SFT datasets for enhanced performance.

Implementation Details

The model utilizes a sophisticated architecture combining a CLIP-L visual encoder with an MLP projector, operating at a resolution of 336. It employs a strategic training approach with frozen LLM and ViT during pretraining, followed by full LLM training with LoRA ViT during fine-tuning.

  • Visual Encoder: CLIP-ViT-Large-patch14-336
  • Base Model: meta-llama/Meta-Llama-3-8B-Instruct
  • Training Strategy: Full LLM with LoRA ViT fine-tuning
  • Dataset Size: 1246K pretraining + 1268K fine-tuning samples

Core Capabilities

  • 72.3% accuracy on MMBench Test (EN)
  • 66.4% accuracy on MMBench Test (CN)
  • 70.0% accuracy on AI2D Test
  • Robust performance across multiple vision-language tasks
  • Enhanced multilingual capabilities

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its improved performance metrics compared to previous versions, particularly in MMBench and AI2D tests. It leverages a unique combination of ShareGPT4V-PT and InternVL-SFT datasets, resulting in better cross-modal understanding.

Q: What are the recommended use cases?

The model excels in vision-language tasks including visual question answering, image understanding, and multilingual image-text interactions. It's particularly suitable for applications requiring detailed image analysis and natural language responses.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.