Minitron-8B-Base

Property	Value
Model Size	8B parameters
Developer	NVIDIA
License	NVIDIA Open Model License
Research Paper	arXiv:2407.14679
Training Period	February 2024 - June 2024

What is Minitron-8B-Base?

Minitron-8B-Base is an innovative large language model developed by NVIDIA through a sophisticated pruning process of the larger Nemotron-4 15B model. What makes it particularly interesting is its efficient training approach, requiring 40x fewer training tokens compared to training from scratch, while maintaining competitive performance with models like Mistral 7B and Gemma 7B.

Implementation Details

The model features a sophisticated architecture with 4096 embedding size, 48 attention heads, and 16384 MLP intermediate dimension. It implements Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE) for enhanced performance.

Architecture: Transformer Decoder (auto-regressive language model)
Network Base: Nemotron-4
Training Data: 94 billion tokens
Input/Output: Text-based string format

Core Capabilities

MMLU Score: 64.5 (5-shot)
HellaSwag: 81.6 (zero-shot)
GSM8K: 54.2 (zero-shot)
Code Generation: 31.6 (HumanEval p@1, 0-shot)
Multilingual support including code generation capabilities

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its efficient training approach, achieving comparable performance to larger models while requiring significantly less computational resources. The pruning and distillation process results in 1.8x compute cost savings for the entire model family.

Q: What are the recommended use cases?

The model is designed for research and development purposes, excelling in tasks like language understanding, code generation, and general text generation. However, users should be aware of potential limitations regarding toxic content and societal biases.

Minitron-8B-Base

Minitron-8B-Base

What is Minitron-8B-Base?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models