vietnamese-embedding
Property | Value |
---|---|
Parameter Count | 135M |
License | Apache 2.0 |
Primary Paper | SimCSE Paper |
Tensor Type | F32 |
What is vietnamese-embedding?
vietnamese-embedding is a state-of-the-art sentence embedding model specifically designed for the Vietnamese language. Built on PhoBERT's RoBERTa architecture, it generates 768-dimensional vectors that capture the semantic meaning of Vietnamese text. The model has achieved impressive results, outperforming other Vietnamese embedding models with 84.87% accuracy on the STSB benchmark.
Implementation Details
The model implements a sophisticated four-stage training process, utilizing SimCSE approach with supervised contrastive learning. It employs a Transformer architecture with mean pooling and has been fine-tuned on multiple Vietnamese datasets including ViNLI-SimCSE-supervised and XNLI-vn.
- Pre-trained base: PhoBERT (RoBERTa architecture)
- Embedding dimension: 768
- Maximum sequence length: 512
- Training methodology: Multi-stage fine-tuning with triplet loss
Core Capabilities
- Semantic sentence similarity computation
- High-quality Vietnamese text embeddings
- Supports various NLP tasks including clustering and semantic search
- Demonstrated superior performance across multiple STS benchmarks
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its four-stage training process specifically optimized for Vietnamese language understanding, resulting in state-of-the-art performance across multiple semantic textual similarity benchmarks.
Q: What are the recommended use cases?
The model is ideal for semantic search, text clustering, sentence similarity comparison, and other NLP tasks requiring deep semantic understanding of Vietnamese text.