DistilBERT Base Cased
Property | Value |
---|---|
Parameter Count | 65.8M |
License | Apache 2.0 |
Paper | View Paper |
Training Data | BookCorpus + Wikipedia |
Framework Support | PyTorch, TensorFlow, ONNX |
What is distilbert-base-cased?
DistilBERT base cased is a compact and efficient transformer model that serves as a distilled version of BERT. Created through knowledge distillation, it retains 97% of BERT's language understanding capabilities while being 40% smaller and 60% faster. This cased version maintains sensitivity to capitalization, distinguishing between words like "english" and "English".
Implementation Details
The model was trained using a three-fold approach: distillation loss (matching BERT's probabilities), masked language modeling, and cosine embedding loss. It uses the same architecture as BERT but with reduced parameters, making it more suitable for production environments with resource constraints.
- Architecture: Transformer-based with knowledge distillation
- Training Duration: 90 hours on 8 16GB V100 GPUs
- Vocabulary Size: 30,000 tokens
- Maximum Sequence Length: 512 tokens
Core Capabilities
- Masked Language Modeling
- Sequence Classification
- Token Classification
- Question Answering (when fine-tuned)
- Bidirectional Context Understanding
Frequently Asked Questions
Q: What makes this model unique?
DistilBERT's uniqueness lies in its efficient knowledge distillation process, which creates a smaller, faster model while maintaining impressive performance. It achieves 95%+ of BERT's performance on GLUE benchmarks with only 65.8M parameters.
Q: What are the recommended use cases?
The model excels in tasks requiring whole-sentence understanding, such as text classification, named entity recognition, and question answering. It's particularly suitable for production environments where computational resources are limited but high-quality language understanding is required.