TinyLlama-1.1B-Chat-v0.3-GPTQ

Maintained By
TheBloke

TinyLlama-1.1B-Chat-v0.3-GPTQ

PropertyValue
Parameter Count1.1B
LicenseApache 2.0
QuantizationGPTQ 4-bit
Model Size262M params

What is TinyLlama-1.1B-Chat-v0.3-GPTQ?

TinyLlama-1.1B-Chat-v0.3-GPTQ is a quantized version of the TinyLlama chat model, designed for efficient deployment while maintaining performance. This model was created by TheBloke, based on Zhang Peiyuan's original work, and features multiple quantization options to balance performance and resource usage.

Implementation Details

The model uses the same architecture and tokenizer as Llama 2, making it compatible with many existing Llama-based projects. It was trained on 3 trillion tokens and fine-tuned using the OpenAssistant dataset following the ChatML format.

  • Multiple GPTQ quantization options (4-bit and 8-bit variants)
  • Supports various group sizes (32g, 64g, 128g) for different VRAM requirements
  • Compatible with AutoGPTQ and text-generation-inference
  • Uses the ChatML prompt template for structured interactions

Core Capabilities

  • Efficient resource usage with multiple quantization options
  • Plug-and-play compatibility with Llama ecosystem
  • Support for both GPU and CPU inference
  • Optimized for chat-based applications

Frequently Asked Questions

Q: What makes this model unique?

Its compact size and multiple quantization options make it ideal for resource-constrained environments while maintaining compatibility with the Llama ecosystem.

Q: What are the recommended use cases?

This model is perfect for chat applications, especially in scenarios where computational resources are limited. It's particularly suitable for edge devices and applications requiring efficient deployment.

The first platform built for prompt engineering