Llama-3.1-8B-Instruct-FP8-KV

Maintained By
amd

Llama-3.1-8B-Instruct-FP8-KV

PropertyValue
Parameter Count8.03B
Model TypeInstruction-tuned LLM
LicenseLlama 3.1
QuantizationFP8 symmetric per-tensor
Base ModelMeta-Llama-3.1-8B-Instruct

What is Llama-3.1-8B-Instruct-FP8-KV?

This is a quantized version of Meta's Llama 3.1 8B parameter model, optimized using AMD's Quark framework. The model implements FP8 quantization for improved efficiency while maintaining performance, with a minimal perplexity increase from 7.2169 to 7.2752 on the wikitext2 benchmark.

Implementation Details

The model utilizes sophisticated quantization strategies applied through the Quark framework, targeting all linear layers except the language model head. The quantization includes:

  • FP8 symmetric per-tensor quantization for weights
  • FP8 symmetric per-tensor quantization for activations
  • FP8 symmetric per-tensor quantization for KV cache

Core Capabilities

  • Efficient deployment using vLLM backend compatibility
  • Maintained accuracy with minimal performance loss
  • Optimized memory usage through FP8 quantization
  • Supports multi-GPU deployment for larger implementations

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient FP8 quantization approach that maintains performance while reducing memory requirements, particularly in the key-value cache implementation.

Q: What are the recommended use cases?

The model is ideal for deployment scenarios where memory efficiency is crucial but performance cannot be compromised. It's particularly well-suited for production environments using vLLM backend.

The first platform built for prompt engineering