vietnamese-embedding

Maintained By
dangvantuan

vietnamese-embedding

PropertyValue
Parameter Count135M
LicenseApache 2.0
Primary PaperSimCSE Paper
Tensor TypeF32

What is vietnamese-embedding?

vietnamese-embedding is a state-of-the-art sentence embedding model specifically designed for the Vietnamese language. Built on PhoBERT's RoBERTa architecture, it generates 768-dimensional vectors that capture the semantic meaning of Vietnamese text. The model has achieved impressive results, outperforming other Vietnamese embedding models with 84.87% accuracy on the STSB benchmark.

Implementation Details

The model implements a sophisticated four-stage training process, utilizing SimCSE approach with supervised contrastive learning. It employs a Transformer architecture with mean pooling and has been fine-tuned on multiple Vietnamese datasets including ViNLI-SimCSE-supervised and XNLI-vn.

  • Pre-trained base: PhoBERT (RoBERTa architecture)
  • Embedding dimension: 768
  • Maximum sequence length: 512
  • Training methodology: Multi-stage fine-tuning with triplet loss

Core Capabilities

  • Semantic sentence similarity computation
  • High-quality Vietnamese text embeddings
  • Supports various NLP tasks including clustering and semantic search
  • Demonstrated superior performance across multiple STS benchmarks

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its four-stage training process specifically optimized for Vietnamese language understanding, resulting in state-of-the-art performance across multiple semantic textual similarity benchmarks.

Q: What are the recommended use cases?

The model is ideal for semantic search, text clustering, sentence similarity comparison, and other NLP tasks requiring deep semantic understanding of Vietnamese text.

The first platform built for prompt engineering