Meta-Llama-3-8B-Instruct-FP8
Property | Value |
---|---|
Parameter Count | 8.03B |
Model Type | Instruction-tuned Language Model |
License | Llama3 |
Quantization | FP8 |
Language | English |
What is Meta-Llama-3-8B-Instruct-FP8?
Meta-Llama-3-8B-Instruct-FP8 is a quantized version of the original Llama-3 8B model, optimized for efficient deployment while maintaining near-original performance. Through FP8 quantization, it reduces the model's memory footprint by approximately 50% while preserving 99.28% of the original model's accuracy.
Implementation Details
The model implements symmetric per-tensor quantization on the linear operators within transformer blocks, using AutoFP8 with calibration samples from UltraChat. It's specifically designed for deployment with vLLM >= 0.5.0 and achieves an impressive average score of 68.22 on the OpenLLM benchmark.
- Weight and activation quantization using FP8 data type
- 50% reduction in disk size and GPU memory requirements
- Optimized for vLLM deployment
- Calibrated using 512 sequences of UltraChat
Core Capabilities
- Assistant-like chat functionality
- Maintains high performance across various benchmarks (MMLU, ARC Challenge, GSM-8K)
- Efficient inference with reduced resource requirements
- English language processing and generation
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its efficient FP8 quantization that significantly reduces resource requirements while maintaining 99.28% of the original model's performance. It's specifically optimized for production deployment with vLLM.
Q: What are the recommended use cases?
The model is best suited for commercial and research applications requiring English language processing, particularly in assistant-like chat scenarios where resource efficiency is important. It's designed to handle various tasks while consuming less memory than the original model.