Meta-Llama-3-8B-Instruct-FP8

Maintained By
neuralmagic

Meta-Llama-3-8B-Instruct-FP8

PropertyValue
Parameter Count8.03B
Model TypeInstruction-tuned Language Model
LicenseLlama3
QuantizationFP8
LanguageEnglish

What is Meta-Llama-3-8B-Instruct-FP8?

Meta-Llama-3-8B-Instruct-FP8 is a quantized version of the original Llama-3 8B model, optimized for efficient deployment while maintaining near-original performance. Through FP8 quantization, it reduces the model's memory footprint by approximately 50% while preserving 99.28% of the original model's accuracy.

Implementation Details

The model implements symmetric per-tensor quantization on the linear operators within transformer blocks, using AutoFP8 with calibration samples from UltraChat. It's specifically designed for deployment with vLLM >= 0.5.0 and achieves an impressive average score of 68.22 on the OpenLLM benchmark.

  • Weight and activation quantization using FP8 data type
  • 50% reduction in disk size and GPU memory requirements
  • Optimized for vLLM deployment
  • Calibrated using 512 sequences of UltraChat

Core Capabilities

  • Assistant-like chat functionality
  • Maintains high performance across various benchmarks (MMLU, ARC Challenge, GSM-8K)
  • Efficient inference with reduced resource requirements
  • English language processing and generation

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient FP8 quantization that significantly reduces resource requirements while maintaining 99.28% of the original model's performance. It's specifically optimized for production deployment with vLLM.

Q: What are the recommended use cases?

The model is best suited for commercial and research applications requiring English language processing, particularly in assistant-like chat scenarios where resource efficiency is important. It's designed to handle various tasks while consuming less memory than the original model.

The first platform built for prompt engineering