Meta-Llama-3.1-70B-Instruct-FP8

Property	Value
Parameter Count	70.6B
License	llama3.1
Supported Languages	8 (en, de, fr, it, pt, hi, es, th)
Quantization	FP8 (8-bit)
Release Date	7/23/2024

What is Meta-Llama-3.1-70B-Instruct-FP8?

Meta-Llama-3.1-70B-Instruct-FP8 is a highly optimized version of the original Meta-Llama-3.1-70B-Instruct model, featuring FP8 quantization for both weights and activations. This optimization reduces the model's disk size and GPU memory requirements by approximately 50% while maintaining an impressive 99.88% of the original model's performance.

Implementation Details

The model employs symmetric per-tensor quantization on the linear operators within transformer blocks, using LLM Compressor with calibration samples from UltraChat. It achieves an average score of 84.29 on the OpenLLM benchmark, compared to the original model's 84.40.

Weight and activation quantization using FP8 data type
Optimized for vLLM deployment
50% reduction in memory footprint
Calibrated using 512 sequences from UltraChat

Core Capabilities

Multi-language support across 8 languages
Assistant-like chat functionality
High performance on key benchmarks (MMLU, ARC-Challenge, GSM-8K)
Efficient deployment with vLLM backend
Commercial and research use compatibility

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for achieving nearly identical performance to its full-precision counterpart while requiring only half the computational resources through FP8 quantization. It maintains over 99.8% accuracy across major benchmarks while being more deployment-friendly.

Q: What are the recommended use cases?

The model is specifically designed for commercial and research applications requiring assistant-like chat capabilities across multiple languages. It's particularly suitable for deployments where resource efficiency is crucial without compromising on performance.