NVLM-D-72B
Property | Value |
---|---|
Parameter Count | 79.4B |
Model Type | Multimodal LLM |
Architecture | Decoder-Only Transformer |
License | CC-BY-NC-4.0 |
Paper | arXiv:2409.11402 |
What is NVLM-D-72B?
NVLM-D-72B is a state-of-the-art multimodal large language model developed by NVIDIA that represents a significant advancement in vision-language AI capabilities. Built on a decoder-only architecture, this model can process both text and images, performing tasks ranging from optical character recognition to complex multimodal reasoning and coding.
Implementation Details
The model is implemented using PyTorch and features a maximum token length of 128K tokens. It's built upon the Qwen2-72B-Instruct text model and InternViT-6B-448px-V1-5 ViT model, optimized for multi-GPU deployment and efficient processing.
- Supports multiple GPU deployment with optimized load balancing
- Uses BFloat16 precision for efficient computation
- Includes dynamic image preprocessing capabilities
- Implements advanced vision-language processing techniques
Core Capabilities
- Vision-language tasks with state-of-the-art performance
- Advanced OCR capabilities with 852 score on OCRBench
- Strong performance in mathematical reasoning (73.1% on MATH benchmark)
- Excellent results in visual question answering (85.4% on VQAv2)
- Document and chart understanding with 92.6% accuracy on DocVQA
Frequently Asked Questions
Q: What makes this model unique?
NVLM-D-72B stands out for its improved text-only performance over its LLM backbone after multimodal training, and its ability to rival leading proprietary models like GPT-4V in various vision-language tasks.
Q: What are the recommended use cases?
The model excels in document analysis, visual reasoning, OCR tasks, mathematical problem-solving, and general vision-language applications. It's particularly well-suited for enterprise applications requiring both visual and textual understanding.