NVLM-D-72B

Property	Value
Parameter Count	79.4B
Model Type	Multimodal LLM
Architecture	Decoder-Only Transformer
License	CC-BY-NC-4.0
Paper	arXiv:2409.11402

What is NVLM-D-72B?

NVLM-D-72B is a state-of-the-art multimodal large language model developed by NVIDIA that represents a significant advancement in vision-language AI capabilities. Built on a decoder-only architecture, this model can process both text and images, performing tasks ranging from optical character recognition to complex multimodal reasoning and coding.

Implementation Details

The model is implemented using PyTorch and features a maximum token length of 128K tokens. It's built upon the Qwen2-72B-Instruct text model and InternViT-6B-448px-V1-5 ViT model, optimized for multi-GPU deployment and efficient processing.

Supports multiple GPU deployment with optimized load balancing
Uses BFloat16 precision for efficient computation
Includes dynamic image preprocessing capabilities
Implements advanced vision-language processing techniques

Core Capabilities

Vision-language tasks with state-of-the-art performance
Advanced OCR capabilities with 852 score on OCRBench
Strong performance in mathematical reasoning (73.1% on MATH benchmark)
Excellent results in visual question answering (85.4% on VQAv2)
Document and chart understanding with 92.6% accuracy on DocVQA

Frequently Asked Questions

Q: What makes this model unique?

NVLM-D-72B stands out for its improved text-only performance over its LLM backbone after multimodal training, and its ability to rival leading proprietary models like GPT-4V in various vision-language tasks.

Q: What are the recommended use cases?

The model excels in document analysis, visual reasoning, OCR tasks, mathematical problem-solving, and general vision-language applications. It's particularly well-suited for enterprise applications requiring both visual and textual understanding.

NVLM-D-72B

NVLM-D-72B

What is NVLM-D-72B?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models

The first platform built for prompt engineering