distilrubert-base-cased-conversational
Property | Value |
---|---|
Parameters | 135.4M |
Model Size | 517 MB |
Architecture | 6-layer, 768-hidden, 12-heads |
Research Paper | Knowledge Distillation of Russian Language Models |
Author | DeepPavlov |
What is distilrubert-base-cased-conversational?
This is a distilled version of the Russian BERT model specifically optimized for conversational tasks. It was trained on a diverse dataset including OpenSubtitles, social media content from Dirty, Pikabu, and the Taiga corpus. The model represents a more efficient version of its teacher model while maintaining strong performance.
Implementation Details
The model was trained using three key loss functions: KL loss between teacher and student outputs, MLM loss for token prediction, and cosine embedding loss for hidden state alignment. Training was conducted on 8 NVIDIA Tesla P100-SXM2.0 16Gb GPUs for approximately 100 hours.
- Achieves 2x faster CPU inference compared to the teacher model
- 24% smaller model size (517MB vs 679MB)
- Maintains high performance while reducing computational requirements
Core Capabilities
- Conversational text understanding in Russian
- Efficient inference with reduced latency (0.3285s CPU, 0.0212s GPU)
- Higher throughput than the original model (52.25 samples/sec on GPU)
- Suitable for production deployment with smaller resource footprint
Frequently Asked Questions
Q: What makes this model unique?
This model uniquely combines the power of BERT architecture with efficiency gains through distillation, specifically optimized for Russian conversational tasks. It achieves significant speed improvements while maintaining strong performance.
Q: What are the recommended use cases?
The model is particularly well-suited for Russian language processing tasks in conversational contexts, such as chatbots, dialogue systems, and social media analysis. Its reduced size and faster inference make it ideal for production environments with resource constraints.