distilrubert-base-cased-conversational

Property	Value
Parameters	135.4M
Model Size	517 MB
Architecture	6-layer, 768-hidden, 12-heads
Research Paper	Knowledge Distillation of Russian Language Models
Author	DeepPavlov

What is distilrubert-base-cased-conversational?

This is a distilled version of the Russian BERT model specifically optimized for conversational tasks. It was trained on a diverse dataset including OpenSubtitles, social media content from Dirty, Pikabu, and the Taiga corpus. The model represents a more efficient version of its teacher model while maintaining strong performance.

Implementation Details

The model was trained using three key loss functions: KL loss between teacher and student outputs, MLM loss for token prediction, and cosine embedding loss for hidden state alignment. Training was conducted on 8 NVIDIA Tesla P100-SXM2.0 16Gb GPUs for approximately 100 hours.

Achieves 2x faster CPU inference compared to the teacher model
24% smaller model size (517MB vs 679MB)
Maintains high performance while reducing computational requirements

Core Capabilities

Conversational text understanding in Russian
Efficient inference with reduced latency (0.3285s CPU, 0.0212s GPU)
Higher throughput than the original model (52.25 samples/sec on GPU)
Suitable for production deployment with smaller resource footprint

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely combines the power of BERT architecture with efficiency gains through distillation, specifically optimized for Russian conversational tasks. It achieves significant speed improvements while maintaining strong performance.

Q: What are the recommended use cases?

The model is particularly well-suited for Russian language processing tasks in conversational contexts, such as chatbots, dialogue systems, and social media analysis. Its reduced size and faster inference make it ideal for production environments with resource constraints.