DeBERTa V3 XSmall

Property	Value
Author	Microsoft
Parameter Count	22M (backbone) + 48M (embedding)
License	MIT
Paper	DeBERTaV3 Paper

What is deberta-v3-xsmall?

DeBERTa-v3-xsmall is a compact yet powerful language model that implements Microsoft's DeBERTa architecture with ELECTRA-style pre-training. It features 12 layers with a hidden size of 384 and contains 22M backbone parameters, plus a 128K token vocabulary contributing 48M parameters in the embedding layer. Despite its smaller size, it achieves impressive performance on key NLU benchmarks.

Implementation Details

The model utilizes gradient-disentangled embedding sharing and enhanced mask decoder techniques, building upon the success of previous DeBERTa versions. It was trained on 160GB of data, similar to DeBERTa V2, and demonstrates remarkable efficiency in parameter usage.

12-layer architecture with 384 hidden size
128K token vocabulary
ELECTRA-style pre-training with gradient-disentangled embedding sharing
Achieves 84.8/82.0 F1/EM scores on SQuAD 2.0
88.1/88.3 accuracy on MNLI-m/mm tasks

Core Capabilities

Fill-mask prediction tasks
Natural Language Understanding (NLU)
Question Answering
Natural Language Inference

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for achieving near BERT-base level performance with only 22M backbone parameters, making it highly efficient for deployment while maintaining strong NLU capabilities.

Q: What are the recommended use cases?

The model is particularly well-suited for resource-constrained environments that require strong NLU capabilities, including text classification, question answering, and natural language inference tasks.