InternVL2-4B
Property | Value |
---|---|
Parameter Count | 4.15B |
Model Type | Multimodal LLM |
License | MIT |
Paper | InternVL Paper |
Architecture | InternViT-300M-448px + MLP Projector + Phi-3-mini-128k-instruct |
What is InternVL2-4B?
InternVL2-4B is a state-of-the-art multimodal large language model that combines advanced vision and language capabilities. It's part of the InternVL 2.0 series and represents a significant advancement in multimodal AI, featuring an 8k context window and superior performance across various visual-linguistic tasks.
Implementation Details
The model integrates InternViT-300M-448px for vision processing, an MLP projector for multimodal fusion, and Phi-3-mini-128k-instruct for language understanding. It supports both 16-bit (bf16/fp16) inference and 4/8-bit quantization for efficient deployment.
- Training with 8k context window for enhanced long-form processing
- Support for multiple image inputs and video processing
- Advanced OCR and document understanding capabilities
- Multilingual support with strong performance in both English and Chinese
Core Capabilities
- Document and Chart Comprehension (89.2% on DocVQA test)
- Scene Text Understanding (74.4% on TextVQA)
- Visual Reasoning (78.6% on MMBench-EN)
- Mathematical Problem Solving (58.6% on MathVista)
- Video Understanding (63.7% on MVBench)
- Visual Grounding (84.4% average on RefCOCO benchmarks)
Frequently Asked Questions
Q: What makes this model unique?
InternVL2-4B stands out for its balanced architecture and strong performance across diverse tasks, particularly in document understanding and OCR, while maintaining a relatively compact size of 4.15B parameters.
Q: What are the recommended use cases?
The model excels in document analysis, chart interpretation, visual question answering, and video understanding tasks. It's particularly suitable for applications requiring strong multimodal reasoning and document processing capabilities.