Llama-3.2-1B-Instruct-quantized.w8a8

Maintained By
neuralmagic

Llama-3.2-1B-Instruct-quantized.w8a8

PropertyValue
Parameter Count1.5B
LicenseLlama3.2
Supported LanguagesEnglish, German, French, Italian, Portuguese, Hindi, Spanish, Thai
Research PapersSmoothQuant, GPTQ

What is Llama-3.2-1B-Instruct-quantized.w8a8?

This is a highly optimized version of the Llama-3.2-1B-Instruct model, featuring 8-bit quantization for both weights and activations. The model maintains impressive performance, achieving scores within 5% of the original model across various benchmarks while reducing memory requirements by approximately 50%.

Implementation Details

The model implements sophisticated quantization techniques using both SmoothQuant and GPTQ algorithms. It uses symmetric static per-channel quantization for weights and symmetric dynamic per-token quantization for activations, optimizing the balance between performance and efficiency.

  • 50% reduction in GPU memory requirements
  • 2x increase in matrix-multiply compute throughput
  • 8-bit precision for weights and activations
  • Maintains 98.7% average performance compared to base model

Core Capabilities

  • Multi-lingual support across 8 languages
  • Assistant-like chat functionality
  • Benchmark performance: 47.95% on MMLU (5-shot), 46.70% on GSM-8K
  • Efficient deployment using vLLM backend

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient quantization approach that dramatically reduces resource requirements while maintaining near-original performance. It's particularly notable for achieving this balance across multiple languages and tasks.

Q: What are the recommended use cases?

The model is ideal for commercial and research applications requiring multilingual assistant-like chat capabilities, particularly in resource-constrained environments where efficient memory usage is crucial. It's specifically optimized for deployment scenarios requiring high throughput with limited resources.

The first platform built for prompt engineering