llama2_7b_chat_uncensored-GPTQ

Maintained By
TheBloke

Llama2 7B Chat Uncensored GPTQ

PropertyValue
Parameter Count7 Billion
Model TypeLlama2
LicenseOther + Meta Llama 2 License
Quantization4-bit GPTQ

What is llama2_7b_chat_uncensored-GPTQ?

This is a quantized version of George Sung's Llama2 7B Chat Uncensored model, optimized by TheBloke for efficient deployment. The model was fine-tuned using QLoRA on the uncensored Wizard-Vicuna conversation dataset, making it suitable for open-ended dialogue generation without traditional content restrictions.

Implementation Details

The model offers multiple GPTQ quantization variants, each optimized for different hardware configurations and performance requirements. It uses a 4-bit precision base with varying group sizes (32g, 64g, 128g) and includes Act Order optimization options.

  • Multiple branch options with different quantization parameters
  • Compatible with AutoGPTQ, Transformers, and ExLlama
  • Supports sequence lengths up to 4096 tokens
  • Includes optimized configurations for different VRAM requirements

Core Capabilities

  • Efficient GPU inference with reduced memory footprint
  • Maintains model quality while reducing size to ~4GB
  • Supports both direct Transformers integration and pipeline usage
  • Optimized for chat-based applications with specific prompt template

Frequently Asked Questions

Q: What makes this model unique?

This model combines the capabilities of Llama 2 with uncensored training data, while offering multiple quantization options for efficient deployment. The GPTQ quantization maintains model quality while significantly reducing the resource requirements.

Q: What are the recommended use cases?

The model is best suited for applications requiring open-ended dialogue generation, particularly where traditional content restrictions might be limiting. It's especially useful in scenarios where GPU memory is constrained but model quality needs to be maintained.

The first platform built for prompt engineering