Llama-3.2-3B-Instruct-FP8
Property | Value |
---|---|
Parameter Count | 3.61B |
Model Type | Instruction-tuned Language Model |
Architecture | LLaMA-3 |
License | LLaMA 3.2 |
Supported Languages | 8 (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai) |
What is Llama-3.2-3B-Instruct-FP8?
Llama-3.2-3B-Instruct-FP8 is an optimized version of Meta's Llama-3.2-3B-Instruct model, featuring innovative FP8 quantization for both weights and activations. This optimization reduces GPU memory requirements by approximately 50% while maintaining remarkable performance, achieving 99.7% of the original model's capabilities across various benchmarks.
Implementation Details
The model employs sophisticated quantization techniques, specifically targeting linear operators within transformer blocks. It uses symmetric static per-channel quantization for weights and symmetric per-tensor quantization for activations, all implemented through the llm-compressor library.
- Weight quantization reduces from 16 to 8 bits
- 50% reduction in GPU memory usage
- 2x increase in matrix-multiply compute throughput
- Calibrated using 512 sequences from Neural Magic's LLM compression dataset
Core Capabilities
- Multi-lingual support across 8 languages
- Assistant-style chat functionality
- Benchmark performance: 62.61% on MMLU (5-shot), 77.86% on GSM-8K
- Efficient deployment through vLLM backend
- Optimized for commercial and research applications
Frequently Asked Questions
Q: What makes this model unique?
The model's primary distinction lies in its efficient FP8 quantization, which significantly reduces resource requirements while maintaining near-original performance. This makes it particularly valuable for deployment scenarios where computational resources are constrained.
Q: What are the recommended use cases?
The model is ideal for commercial and research applications requiring multi-lingual capabilities. It excels in assistant-like chat scenarios and can be effectively deployed in production environments using the vLLM backend for optimal performance.