Meta-Llama-3.1-70B-Instruct-FP8
Property | Value |
---|---|
Parameter Count | 70.6B |
License | llama3.1 |
Supported Languages | 8 (en, de, fr, it, pt, hi, es, th) |
Quantization | FP8 (8-bit) |
Release Date | 7/23/2024 |
What is Meta-Llama-3.1-70B-Instruct-FP8?
Meta-Llama-3.1-70B-Instruct-FP8 is a highly optimized version of the original Meta-Llama-3.1-70B-Instruct model, featuring FP8 quantization for both weights and activations. This optimization reduces the model's disk size and GPU memory requirements by approximately 50% while maintaining an impressive 99.88% of the original model's performance.
Implementation Details
The model employs symmetric per-tensor quantization on the linear operators within transformer blocks, using LLM Compressor with calibration samples from UltraChat. It achieves an average score of 84.29 on the OpenLLM benchmark, compared to the original model's 84.40.
- Weight and activation quantization using FP8 data type
- Optimized for vLLM deployment
- 50% reduction in memory footprint
- Calibrated using 512 sequences from UltraChat
Core Capabilities
- Multi-language support across 8 languages
- Assistant-like chat functionality
- High performance on key benchmarks (MMLU, ARC-Challenge, GSM-8K)
- Efficient deployment with vLLM backend
- Commercial and research use compatibility
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for achieving nearly identical performance to its full-precision counterpart while requiring only half the computational resources through FP8 quantization. It maintains over 99.8% accuracy across major benchmarks while being more deployment-friendly.
Q: What are the recommended use cases?
The model is specifically designed for commercial and research applications requiring assistant-like chat capabilities across multiple languages. It's particularly suitable for deployments where resource efficiency is crucial without compromising on performance.