OpenAssistant-Llama-30b-4bit
Property | Value |
---|---|
Author | MetaIX |
Framework | PyTorch |
Base Model | Llama 30B |
Quantization Types | GPTQ & GGML |
What is OpenAssistant-Llama-30b-4bit?
OpenAssistant-Llama-30b-4bit is a quantized version of the OpenAssistant's native fine-tuned Llama 30B model. It offers multiple quantization options to accommodate different hardware setups and performance requirements. The model includes both GPTQ and GGML variants, making it versatile for both GPU and CPU deployments.
Implementation Details
The model comes in multiple quantized versions: Two GPTQ variants (with different optimization parameters) and three GGML variants. The GPTQ versions are optimized with --true-sequential and act-order, as well as --true-sequential --groupsize 128 configurations. The GGML versions are quantized using q4_1, q5_0, and q5_1 parameters.
- GPTQ variant with true-sequential and act-order optimizations (24GB VRAM usage)
- GPTQ variant with true-sequential and groupsize 128 optimizations (higher VRAM usage, better performance)
- Three GGML variants (q4_1, q5_0, q5_1) for CPU usage
Core Capabilities
- Efficient resource utilization with 4-bit quantization
- Supports both GPU and CPU deployment
- Benchmark scores: Wikitext2 (4.96/4.64), PTB-New (9.64/9.12), C4-New (7.20/6.87)
- Compatible with Oobabooga's Text Generation WebUI and KoboldAI
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its versatile quantization options and optimized performance across different hardware configurations. It provides multiple variants to balance between VRAM usage and performance quality.
Q: What are the recommended use cases?
The model is ideal for text generation tasks where resource efficiency is crucial. The GPTQ variants are recommended for GPU users with at least 24GB VRAM, while GGML variants are perfect for CPU-based deployments.