DeBERTa V3 XSmall
Property | Value |
---|---|
Author | Microsoft |
Parameter Count | 22M (backbone) + 48M (embedding) |
License | MIT |
Paper | DeBERTaV3 Paper |
What is deberta-v3-xsmall?
DeBERTa-v3-xsmall is a compact yet powerful language model that implements Microsoft's DeBERTa architecture with ELECTRA-style pre-training. It features 12 layers with a hidden size of 384 and contains 22M backbone parameters, plus a 128K token vocabulary contributing 48M parameters in the embedding layer. Despite its smaller size, it achieves impressive performance on key NLU benchmarks.
Implementation Details
The model utilizes gradient-disentangled embedding sharing and enhanced mask decoder techniques, building upon the success of previous DeBERTa versions. It was trained on 160GB of data, similar to DeBERTa V2, and demonstrates remarkable efficiency in parameter usage.
- 12-layer architecture with 384 hidden size
- 128K token vocabulary
- ELECTRA-style pre-training with gradient-disentangled embedding sharing
- Achieves 84.8/82.0 F1/EM scores on SQuAD 2.0
- 88.1/88.3 accuracy on MNLI-m/mm tasks
Core Capabilities
- Fill-mask prediction tasks
- Natural Language Understanding (NLU)
- Question Answering
- Natural Language Inference
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for achieving near BERT-base level performance with only 22M backbone parameters, making it highly efficient for deployment while maintaining strong NLU capabilities.
Q: What are the recommended use cases?
The model is particularly well-suited for resource-constrained environments that require strong NLU capabilities, including text classification, question answering, and natural language inference tasks.