OpenAssistant-Llama-30b-4bit

Property	Value
Author	MetaIX
Framework	PyTorch
Base Model	Llama 30B
Quantization Types	GPTQ & GGML

What is OpenAssistant-Llama-30b-4bit?

OpenAssistant-Llama-30b-4bit is a quantized version of the OpenAssistant's native fine-tuned Llama 30B model. It offers multiple quantization options to accommodate different hardware setups and performance requirements. The model includes both GPTQ and GGML variants, making it versatile for both GPU and CPU deployments.

Implementation Details

The model comes in multiple quantized versions: Two GPTQ variants (with different optimization parameters) and three GGML variants. The GPTQ versions are optimized with --true-sequential and act-order, as well as --true-sequential --groupsize 128 configurations. The GGML versions are quantized using q4_1, q5_0, and q5_1 parameters.

GPTQ variant with true-sequential and act-order optimizations (24GB VRAM usage)
GPTQ variant with true-sequential and groupsize 128 optimizations (higher VRAM usage, better performance)
Three GGML variants (q4_1, q5_0, q5_1) for CPU usage

Core Capabilities

Efficient resource utilization with 4-bit quantization
Supports both GPU and CPU deployment
Benchmark scores: Wikitext2 (4.96/4.64), PTB-New (9.64/9.12), C4-New (7.20/6.87)
Compatible with Oobabooga's Text Generation WebUI and KoboldAI

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its versatile quantization options and optimized performance across different hardware configurations. It provides multiple variants to balance between VRAM usage and performance quality.

Q: What are the recommended use cases?

The model is ideal for text generation tasks where resource efficiency is crucial. The GPTQ variants are recommended for GPU users with at least 24GB VRAM, while GGML variants are perfect for CPU-based deployments.