Llama-3.2-3B-Instruct-FP8

Maintained By
neuralmagic

Llama-3.2-3B-Instruct-FP8

PropertyValue
Parameter Count3.61B
Model TypeInstruction-tuned Language Model
ArchitectureLLaMA-3
LicenseLLaMA 3.2
Supported Languages8 (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai)

What is Llama-3.2-3B-Instruct-FP8?

Llama-3.2-3B-Instruct-FP8 is an optimized version of Meta's Llama-3.2-3B-Instruct model, featuring innovative FP8 quantization for both weights and activations. This optimization reduces GPU memory requirements by approximately 50% while maintaining remarkable performance, achieving 99.7% of the original model's capabilities across various benchmarks.

Implementation Details

The model employs sophisticated quantization techniques, specifically targeting linear operators within transformer blocks. It uses symmetric static per-channel quantization for weights and symmetric per-tensor quantization for activations, all implemented through the llm-compressor library.

  • Weight quantization reduces from 16 to 8 bits
  • 50% reduction in GPU memory usage
  • 2x increase in matrix-multiply compute throughput
  • Calibrated using 512 sequences from Neural Magic's LLM compression dataset

Core Capabilities

  • Multi-lingual support across 8 languages
  • Assistant-style chat functionality
  • Benchmark performance: 62.61% on MMLU (5-shot), 77.86% on GSM-8K
  • Efficient deployment through vLLM backend
  • Optimized for commercial and research applications

Frequently Asked Questions

Q: What makes this model unique?

The model's primary distinction lies in its efficient FP8 quantization, which significantly reduces resource requirements while maintaining near-original performance. This makes it particularly valuable for deployment scenarios where computational resources are constrained.

Q: What are the recommended use cases?

The model is ideal for commercial and research applications requiring multi-lingual capabilities. It excels in assistant-like chat scenarios and can be effectively deployed in production environments using the vLLM backend for optimal performance.

The first platform built for prompt engineering