Minitron-8B-Base
Property | Value |
---|---|
Model Size | 8B parameters |
Developer | NVIDIA |
License | NVIDIA Open Model License |
Research Paper | arXiv:2407.14679 |
Training Period | February 2024 - June 2024 |
What is Minitron-8B-Base?
Minitron-8B-Base is an innovative large language model developed by NVIDIA through a sophisticated pruning process of the larger Nemotron-4 15B model. What makes it particularly interesting is its efficient training approach, requiring 40x fewer training tokens compared to training from scratch, while maintaining competitive performance with models like Mistral 7B and Gemma 7B.
Implementation Details
The model features a sophisticated architecture with 4096 embedding size, 48 attention heads, and 16384 MLP intermediate dimension. It implements Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE) for enhanced performance.
- Architecture: Transformer Decoder (auto-regressive language model)
- Network Base: Nemotron-4
- Training Data: 94 billion tokens
- Input/Output: Text-based string format
Core Capabilities
- MMLU Score: 64.5 (5-shot)
- HellaSwag: 81.6 (zero-shot)
- GSM8K: 54.2 (zero-shot)
- Code Generation: 31.6 (HumanEval p@1, 0-shot)
- Multilingual support including code generation capabilities
Frequently Asked Questions
Q: What makes this model unique?
The model's uniqueness lies in its efficient training approach, achieving comparable performance to larger models while requiring significantly less computational resources. The pruning and distillation process results in 1.8x compute cost savings for the entire model family.
Q: What are the recommended use cases?
The model is designed for research and development purposes, excelling in tasks like language understanding, code generation, and general text generation. However, users should be aware of potential limitations regarding toxic content and societal biases.