Llama-3.1-70B-Instruct-FP8-KV

Property	Value
Parameter Count	70.6B
License	Llama 3.1
Tensor Type	BF16/F8_E4M3
Author	AMD

What is Llama-3.1-70B-Instruct-FP8-KV?

This is an optimized version of Meta's Llama 3.1 70B Instruct model, developed by AMD using their Quark quantization framework. The model implements FP8 quantization for both weights and activations, including the KV cache, while maintaining performance very close to the original model.

Implementation Details

The model employs a sophisticated quantization strategy using AMD's Quark framework, applying FP8 symmetric per-tensor quantization to all linear layers except the "lm_head". The implementation includes both weight and activation quantization, along with KV cache optimization.

FP8 symmetric per-tensor quantization for weights and activations
Calibrated using 128 samples from the Pile dataset
Compatible with vLLM backend for efficient deployment
Maintains close perplexity scores to the original model (3.8561 vs 3.7797)

Core Capabilities

Efficient deployment with reduced memory footprint
vLLM-compatible execution
Maintains high accuracy with minimal perplexity degradation
Supports multi-GPU deployment for large-scale applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient FP8 quantization approach that maintains performance while reducing memory requirements, particularly in the KV cache, making it more deployable in production environments.

Q: What are the recommended use cases?

The model is ideal for production deployments where memory efficiency is crucial but maintaining high performance is essential. It's particularly suitable for applications requiring the capabilities of Llama 3.1 but with optimized resource usage.