Llama-2-13B-chat-GPTQ

Property	Value
Base Model	Meta's Llama-2-13B-chat
Parameter Count	13 Billion
Quantization	4-bit GPTQ
License	Llama2
Paper	Research Paper

What is Llama-2-13B-chat-GPTQ?

Llama-2-13B-chat-GPTQ is a quantized version of Meta's Llama 2 chat model, optimized for efficient deployment while maintaining performance. This version uses GPTQ quantization to reduce the model size and memory footprint while preserving the core capabilities of the original model.

Implementation Details

The model implements various quantization options, with the main branch offering 4-bit precision with 128 group size. The quantization process utilizes the wikitext dataset with a sequence length of 4096 tokens. The model is compatible with multiple frameworks including AutoGPTQ, Transformers, and ExLlama.

Multiple quantization options available (4-bit and 8-bit variants)
Group size options from 32g to 128g for different performance/memory tradeoffs
Compatibility with major frameworks and inference engines
Optimized for dialogue use cases with implemented chat template

Core Capabilities

Chat-optimized responses with built-in safety parameters
Context window of 4096 tokens
Supports multiple inference frameworks
Various quantization options for different hardware configurations
Maintains the base model's performance while reducing resource requirements

Frequently Asked Questions

Q: What makes this model unique?

This model offers a carefully optimized balance between performance and resource usage through GPTQ quantization, making it practical for deployment on consumer hardware while maintaining the quality of the original Llama 2 model.

Q: What are the recommended use cases?

The model is best suited for dialogue applications, chatbots, and interactive AI assistants where efficient deployment is crucial. It's particularly valuable for scenarios requiring good performance on limited hardware resources.