Llama-2-7B-Chat-GPTQ

Property	Value
Base Model	Llama 2 7B Chat
Parameter Count	7 Billion
License	Llama2
Paper	arXiv:2307.09288
Quantization	4-bit GPTQ

What is Llama-2-7B-Chat-GPTQ?

Llama-2-7B-Chat-GPTQ is a quantized version of Meta's Llama 2 chat model, optimized for efficient deployment while maintaining performance. This model uses GPTQ quantization to reduce the model size and memory requirements while preserving the model's capabilities for dialogue-based applications.

Implementation Details

The model offers multiple quantization options with different group sizes (32g, 64g, and 128g) and Act-Order configurations. The 4-bit quantization reduces the model size to approximately 4GB, making it more accessible for deployment on consumer hardware.

Multiple GPTQ parameter configurations available (4-bit with varying group sizes)
Includes ExLlama compatibility for supported configurations
Uses wikitext dataset for quantization with 4096 sequence length
Supports both AutoGPTQ and text-generation-webui implementations

Core Capabilities

Optimized for dialogue and chat applications
Maintains safety and helpfulness alignment from base Llama 2
Supports context length of 4096 tokens
Compatible with multiple inference frameworks

Frequently Asked Questions

Q: What makes this model unique?

This model combines the capabilities of Llama 2 with efficient quantization, offering multiple compression options to balance between performance and resource usage. It's particularly notable for maintaining high quality while reducing the model size significantly through 4-bit quantization.

Q: What are the recommended use cases?

The model is best suited for dialogue applications, chatbots, and interactive text generation where resource efficiency is important. It's particularly useful for deployment on consumer-grade hardware while maintaining the quality of the original Llama 2 model.