Llama-3.1-70B-Instruct-FP8-KV

Maintained By
amd

Llama-3.1-70B-Instruct-FP8-KV

PropertyValue
Parameter Count70.6B
LicenseLlama 3.1
Tensor TypeBF16/F8_E4M3
AuthorAMD

What is Llama-3.1-70B-Instruct-FP8-KV?

This is an optimized version of Meta's Llama 3.1 70B Instruct model, developed by AMD using their Quark quantization framework. The model implements FP8 quantization for both weights and activations, including the KV cache, while maintaining performance very close to the original model.

Implementation Details

The model employs a sophisticated quantization strategy using AMD's Quark framework, applying FP8 symmetric per-tensor quantization to all linear layers except the "lm_head". The implementation includes both weight and activation quantization, along with KV cache optimization.

  • FP8 symmetric per-tensor quantization for weights and activations
  • Calibrated using 128 samples from the Pile dataset
  • Compatible with vLLM backend for efficient deployment
  • Maintains close perplexity scores to the original model (3.8561 vs 3.7797)

Core Capabilities

  • Efficient deployment with reduced memory footprint
  • vLLM-compatible execution
  • Maintains high accuracy with minimal perplexity degradation
  • Supports multi-GPU deployment for large-scale applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient FP8 quantization approach that maintains performance while reducing memory requirements, particularly in the KV cache, making it more deployable in production environments.

Q: What are the recommended use cases?

The model is ideal for production deployments where memory efficiency is crucial but maintaining high performance is essential. It's particularly suitable for applications requiring the capabilities of Llama 3.1 but with optimized resource usage.

The first platform built for prompt engineering