Llama-2-70B-GPTQ

Maintained By
TheBloke

Llama-2-70B-GPTQ

PropertyValue
Base ModelMeta Llama-2-70B
Parameter Count70 Billion
LicenseLlama2
PaperResearch Paper
Quantization Options4-bit and 3-bit

What is Llama-2-70B-GPTQ?

Llama-2-70B-GPTQ is a quantized version of Meta's Llama-2-70B model, optimized by TheBloke for efficient deployment while maintaining performance. This implementation uses GPTQ quantization to reduce the model's size and memory requirements, making it more accessible for practical applications.

Implementation Details

The model offers multiple quantization options, including 4-bit and 3-bit versions with various group sizes (32g, 64g, 128g). Each variant provides different trade-offs between VRAM usage and model accuracy. The 4-bit versions are compatible with ExLlama, while 3-bit versions offer maximum VRAM efficiency.

  • 4-bit-32g variant: Highest inference quality (40.66 GB)
  • 4-bit-64g variant: Balanced performance (37.99 GB)
  • 3-bit-128g variant: Minimum VRAM usage (28.03 GB)
  • Compatible with AutoGPTQ and Transformers libraries

Core Capabilities

  • Text generation with 4096 token context window
  • Supports multiple inference frameworks including text-generation-webui
  • Achieves 68.9% accuracy on MMLU benchmarks
  • Optimized for English language tasks

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its flexible quantization options, allowing users to choose between maximum quality (4-bit) and maximum efficiency (3-bit) based on their hardware constraints and use case requirements.

Q: What are the recommended use cases?

The model is suitable for commercial and research applications in English, particularly for tasks requiring complex language understanding and generation. It's optimized for deployment in resource-constrained environments while maintaining high performance.

The first platform built for prompt engineering