Meta-Llama-3.1-8B-Instruct-FP8

Maintained By
neuralmagic

Meta-Llama-3.1-8B-Instruct-FP8

PropertyValue
Parameter Count8.03B
Model TypeInstruction-tuned LLM
Supported Languages8 (en, de, fr, it, pt, hi, es, th)
Licensellama3.1
QuantizationFP8 (weights and activations)

What is Meta-Llama-3.1-8B-Instruct-FP8?

Meta-Llama-3.1-8B-Instruct-FP8 is an optimized version of Meta's LLaMA 3.1 model, specifically designed for efficient deployment while maintaining nearly identical performance to its full-precision counterpart. Through FP8 quantization, it achieves a 50% reduction in disk size and GPU memory requirements while retaining 99.52% of the original model's performance.

Implementation Details

The model utilizes symmetric per-tensor quantization for both weights and activations of linear operators within transformer blocks. It's optimized for deployment with vLLM and was calibrated using 512 sequences from UltraChat dataset.

  • Achieves 73.44 average score on OpenLLM benchmark (vs 73.79 for original)
  • Optimized for commercial and research applications
  • Compatible with vLLM for efficient inference

Core Capabilities

  • Multi-lingual support across 8 languages
  • Assistant-style chat functionality
  • Strong performance on key benchmarks (MMLU: 67.97%, ARC Challenge: 81.66%, GSM-8K: 81.12%)
  • 50% reduced resource requirements compared to original model

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient FP8 quantization that dramatically reduces resource requirements while maintaining over 99.5% of the original model's performance across all benchmarks. It's particularly notable for maintaining this high performance across multiple languages and complex reasoning tasks.

Q: What are the recommended use cases?

The model is ideal for commercial and research applications requiring efficient deployment of large language models, particularly in multi-lingual contexts. It's specifically designed for assistant-like chat applications where resource optimization is crucial but performance cannot be compromised.

The first platform built for prompt engineering