Mistral-NeMo-12B-Instruct

Property	Value
Parameter Count	12 Billion
License	Apache 2.0
Context Window	128k tokens
Architecture	Transformer Decoder
Training Period	June 2024 - July 2024

What is Mistral-NeMo-12B-Instruct?

Mistral-NeMo-12B-Instruct is a sophisticated Large Language Model (LLM) developed through a collaboration between NVIDIA and Mistral AI. This 12B parameter model represents a significant advancement in language AI, featuring multilingual capabilities and innovative technical implementations like FP8 quantization without accuracy loss.

Implementation Details

The model is built on a robust architecture consisting of 40 layers with a dimension of 5,120 and 32 attention heads. It utilizes Grouped-Query Attention (GQA) with 8 key-value heads and implements SwiGLU activation functions. The model employs rotary embeddings with a theta value of 1M and supports a substantial vocabulary size of approximately 128,000 tokens.

40 transformer layers with 5,120 dimensional representations
32 attention heads with 128 dimensional head space
Hidden dimension of 14,436
Advanced SwiGLU activation function
Grouped-Query Attention with 8 KV-heads

Core Capabilities

Multilingual support with emphasis on English language tasks
128k context window for handling long-form content
FP8 quantization support for efficient deployment
Strong performance metrics (MT Bench: 7.84, MixEval Hard: 0.534)
Customizable through NeMo Framework tools

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its efficient architecture combining a relatively compact 12B parameter size with state-of-the-art performance. Its FP8 quantization capability and extensive context window make it particularly suitable for production deployments.

Q: What are the recommended use cases?

The model is primarily designed for English language chat applications but supports multilingual tasks. It's particularly well-suited for scenarios requiring long context understanding and can be customized using NVIDIA's NeMo Framework for specific use cases through techniques like P-tuning, Adapters, and LoRA.