Llama-2-13B-chat-GPTQ
Property | Value |
---|---|
Base Model | Meta's Llama-2-13B-chat |
Parameter Count | 13 Billion |
Quantization | 4-bit GPTQ |
License | Llama2 |
Paper | Research Paper |
What is Llama-2-13B-chat-GPTQ?
Llama-2-13B-chat-GPTQ is a quantized version of Meta's Llama 2 chat model, optimized for efficient deployment while maintaining performance. This version uses GPTQ quantization to reduce the model size and memory footprint while preserving the core capabilities of the original model.
Implementation Details
The model implements various quantization options, with the main branch offering 4-bit precision with 128 group size. The quantization process utilizes the wikitext dataset with a sequence length of 4096 tokens. The model is compatible with multiple frameworks including AutoGPTQ, Transformers, and ExLlama.
- Multiple quantization options available (4-bit and 8-bit variants)
- Group size options from 32g to 128g for different performance/memory tradeoffs
- Compatibility with major frameworks and inference engines
- Optimized for dialogue use cases with implemented chat template
Core Capabilities
- Chat-optimized responses with built-in safety parameters
- Context window of 4096 tokens
- Supports multiple inference frameworks
- Various quantization options for different hardware configurations
- Maintains the base model's performance while reducing resource requirements
Frequently Asked Questions
Q: What makes this model unique?
This model offers a carefully optimized balance between performance and resource usage through GPTQ quantization, making it practical for deployment on consumer hardware while maintaining the quality of the original Llama 2 model.
Q: What are the recommended use cases?
The model is best suited for dialogue applications, chatbots, and interactive AI assistants where efficient deployment is crucial. It's particularly valuable for scenarios requiring good performance on limited hardware resources.