TinyLlama-1.1B-Chat-v0.3-GPTQ

Property	Value
Parameter Count	1.1B
License	Apache 2.0
Quantization	GPTQ 4-bit
Model Size	262M params

What is TinyLlama-1.1B-Chat-v0.3-GPTQ?

TinyLlama-1.1B-Chat-v0.3-GPTQ is a quantized version of the TinyLlama chat model, designed for efficient deployment while maintaining performance. This model was created by TheBloke, based on Zhang Peiyuan's original work, and features multiple quantization options to balance performance and resource usage.

Implementation Details

The model uses the same architecture and tokenizer as Llama 2, making it compatible with many existing Llama-based projects. It was trained on 3 trillion tokens and fine-tuned using the OpenAssistant dataset following the ChatML format.

Multiple GPTQ quantization options (4-bit and 8-bit variants)
Supports various group sizes (32g, 64g, 128g) for different VRAM requirements
Compatible with AutoGPTQ and text-generation-inference
Uses the ChatML prompt template for structured interactions

Core Capabilities

Efficient resource usage with multiple quantization options
Plug-and-play compatibility with Llama ecosystem
Support for both GPU and CPU inference
Optimized for chat-based applications

Frequently Asked Questions

Q: What makes this model unique?

Its compact size and multiple quantization options make it ideal for resource-constrained environments while maintaining compatibility with the Llama ecosystem.

Q: What are the recommended use cases?

This model is perfect for chat applications, especially in scenarios where computational resources are limited. It's particularly suitable for edge devices and applications requiring efficient deployment.