Llama-2-7B-Chat-GPTQ

Maintained By
TheBloke

Llama-2-7B-Chat-GPTQ

PropertyValue
Base ModelLlama 2 7B Chat
Parameter Count7 Billion
LicenseLlama2
PaperarXiv:2307.09288
Quantization4-bit GPTQ

What is Llama-2-7B-Chat-GPTQ?

Llama-2-7B-Chat-GPTQ is a quantized version of Meta's Llama 2 chat model, optimized for efficient deployment while maintaining performance. This model uses GPTQ quantization to reduce the model size and memory requirements while preserving the model's capabilities for dialogue-based applications.

Implementation Details

The model offers multiple quantization options with different group sizes (32g, 64g, and 128g) and Act-Order configurations. The 4-bit quantization reduces the model size to approximately 4GB, making it more accessible for deployment on consumer hardware.

  • Multiple GPTQ parameter configurations available (4-bit with varying group sizes)
  • Includes ExLlama compatibility for supported configurations
  • Uses wikitext dataset for quantization with 4096 sequence length
  • Supports both AutoGPTQ and text-generation-webui implementations

Core Capabilities

  • Optimized for dialogue and chat applications
  • Maintains safety and helpfulness alignment from base Llama 2
  • Supports context length of 4096 tokens
  • Compatible with multiple inference frameworks

Frequently Asked Questions

Q: What makes this model unique?

This model combines the capabilities of Llama 2 with efficient quantization, offering multiple compression options to balance between performance and resource usage. It's particularly notable for maintaining high quality while reducing the model size significantly through 4-bit quantization.

Q: What are the recommended use cases?

The model is best suited for dialogue applications, chatbots, and interactive text generation where resource efficiency is important. It's particularly useful for deployment on consumer-grade hardware while maintaining the quality of the original Llama 2 model.

The first platform built for prompt engineering