Qwen-7B-Chat-Int4
Property | Value |
---|---|
Parameter Count | 2.11B parameters |
Model Type | Quantized Chat Model |
Architecture | 32 layers, 32 heads, 4096 d_model |
License | Tongyi Qianwen License Agreement |
Supported Languages | Chinese, English, Multi-lingual |
What is Qwen-7B-Chat-Int4?
Qwen-7B-Chat-Int4 is a 4-bit quantized version of the Qwen-7B-Chat model, designed for efficient deployment while maintaining impressive performance. The model is built on a Transformer architecture and has been trained on diverse datasets including web texts, professional books, and code repositories. This quantized version significantly reduces memory usage while preserving most of the original model's capabilities.
Implementation Details
The model implements advanced technical features including RoPE relative position encoding, SwiGLU activation functions, and RMSNorm. It uses a vocabulary of approximately 150K tokens optimized for Chinese, English, and code, built upon GPT-4's BPE vocabulary base cl100k_base.
- Architecture: 32 layers, 32 attention heads, 4096 dimension model
- Context Length: 8192 tokens
- Memory Usage: 8.21GB for encoding 2048 tokens
- Inference Speed: 50.09 tokens/s for 2048 tokens with Flash Attention v2
Core Capabilities
- Strong performance in Chinese (59.7% on C-Eval) and English (55.8% on MMLU) evaluations
- Code generation capabilities with 37.2% Pass@1 on HumanEval
- Mathematical reasoning with 50.3% accuracy on GSM8K
- Tool usage and ReAct prompting support
- Efficient inference with reduced memory footprint
Frequently Asked Questions
Q: What makes this model unique?
The model combines efficient 4-bit quantization with strong multi-lingual capabilities and tool usage abilities, making it particularly suitable for deployment in resource-constrained environments while maintaining high performance.
Q: What are the recommended use cases?
The model excels in multi-lingual chat applications, code generation, mathematical problem-solving, and tool-augmented tasks. It's particularly suitable for deployment scenarios where memory efficiency is crucial.