distilbert-base-cased

Maintained By
distilbert

DistilBERT Base Cased

PropertyValue
Parameter Count65.8M
LicenseApache 2.0
PaperView Paper
Training DataBookCorpus + Wikipedia
Framework SupportPyTorch, TensorFlow, ONNX

What is distilbert-base-cased?

DistilBERT base cased is a compact and efficient transformer model that serves as a distilled version of BERT. Created through knowledge distillation, it retains 97% of BERT's language understanding capabilities while being 40% smaller and 60% faster. This cased version maintains sensitivity to capitalization, distinguishing between words like "english" and "English".

Implementation Details

The model was trained using a three-fold approach: distillation loss (matching BERT's probabilities), masked language modeling, and cosine embedding loss. It uses the same architecture as BERT but with reduced parameters, making it more suitable for production environments with resource constraints.

  • Architecture: Transformer-based with knowledge distillation
  • Training Duration: 90 hours on 8 16GB V100 GPUs
  • Vocabulary Size: 30,000 tokens
  • Maximum Sequence Length: 512 tokens

Core Capabilities

  • Masked Language Modeling
  • Sequence Classification
  • Token Classification
  • Question Answering (when fine-tuned)
  • Bidirectional Context Understanding

Frequently Asked Questions

Q: What makes this model unique?

DistilBERT's uniqueness lies in its efficient knowledge distillation process, which creates a smaller, faster model while maintaining impressive performance. It achieves 95%+ of BERT's performance on GLUE benchmarks with only 65.8M parameters.

Q: What are the recommended use cases?

The model excels in tasks requiring whole-sentence understanding, such as text classification, named entity recognition, and question answering. It's particularly suitable for production environments where computational resources are limited but high-quality language understanding is required.

The first platform built for prompt engineering