vietnamese-embedding

Property	Value
Parameter Count	135M
License	Apache 2.0
Primary Paper	SimCSE Paper
Tensor Type	F32

What is vietnamese-embedding?

vietnamese-embedding is a state-of-the-art sentence embedding model specifically designed for the Vietnamese language. Built on PhoBERT's RoBERTa architecture, it generates 768-dimensional vectors that capture the semantic meaning of Vietnamese text. The model has achieved impressive results, outperforming other Vietnamese embedding models with 84.87% accuracy on the STSB benchmark.

Implementation Details

The model implements a sophisticated four-stage training process, utilizing SimCSE approach with supervised contrastive learning. It employs a Transformer architecture with mean pooling and has been fine-tuned on multiple Vietnamese datasets including ViNLI-SimCSE-supervised and XNLI-vn.

Pre-trained base: PhoBERT (RoBERTa architecture)
Embedding dimension: 768
Maximum sequence length: 512
Training methodology: Multi-stage fine-tuning with triplet loss

Core Capabilities

Semantic sentence similarity computation
High-quality Vietnamese text embeddings
Supports various NLP tasks including clustering and semantic search
Demonstrated superior performance across multiple STS benchmarks

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its four-stage training process specifically optimized for Vietnamese language understanding, resulting in state-of-the-art performance across multiple semantic textual similarity benchmarks.

Q: What are the recommended use cases?

The model is ideal for semantic search, text clustering, sentence similarity comparison, and other NLP tasks requiring deep semantic understanding of Vietnamese text.