Llama-3.1-8B-Instruct-FP8-KV

Property	Value
Parameter Count	8.03B
Model Type	Instruction-tuned LLM
License	Llama 3.1
Quantization	FP8 symmetric per-tensor
Base Model	Meta-Llama-3.1-8B-Instruct

What is Llama-3.1-8B-Instruct-FP8-KV?

This is a quantized version of Meta's Llama 3.1 8B parameter model, optimized using AMD's Quark framework. The model implements FP8 quantization for improved efficiency while maintaining performance, with a minimal perplexity increase from 7.2169 to 7.2752 on the wikitext2 benchmark.

Implementation Details

The model utilizes sophisticated quantization strategies applied through the Quark framework, targeting all linear layers except the language model head. The quantization includes:

FP8 symmetric per-tensor quantization for weights
FP8 symmetric per-tensor quantization for activations
FP8 symmetric per-tensor quantization for KV cache

Core Capabilities

Efficient deployment using vLLM backend compatibility
Maintained accuracy with minimal performance loss
Optimized memory usage through FP8 quantization
Supports multi-GPU deployment for larger implementations

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient FP8 quantization approach that maintains performance while reducing memory requirements, particularly in the key-value cache implementation.

Q: What are the recommended use cases?

The model is ideal for deployment scenarios where memory efficiency is crucial but performance cannot be compromised. It's particularly well-suited for production environments using vLLM backend.