vietnamese-bi-encoder

Maintained By
bkai-foundation-models

vietnamese-bi-encoder

PropertyValue
Parameter Count135M
LicenseApache 2.0
ArchitecturePhoBERT-base-v2 backbone
Tensor TypeF32

What is vietnamese-bi-encoder?

vietnamese-bi-encoder is a specialized sentence transformer model designed for Vietnamese language processing. It maps sentences and paragraphs to 768-dimensional dense vector spaces, enabling sophisticated semantic search and clustering operations. Built on the PhoBERT-base-v2 architecture, this model has been trained on a comprehensive dataset including Vietnamese translations of MS Macro, SQuAD v2, and legal text data.

Implementation Details

The model employs a sophisticated architecture combining a RoBERTa-based transformer with mean pooling. It's trained using Multiple Negatives Ranking Loss with a scale of 20.0 and cosine similarity as the similarity function. The training process involved 15 epochs with AdamW optimizer, using a learning rate of 2e-05 and 1000 warmup steps.

  • Maximum sequence length: 256 tokens
  • Word embedding dimension: 768
  • Supports mean token pooling
  • Requires word-segmented input for optimal performance

Core Capabilities

  • Sentence similarity computation
  • Semantic search functionality
  • Text clustering
  • Dense vector representation generation
  • Achieves 73.28% Acc@1 on legal text retrieval tasks

Frequently Asked Questions

Q: What makes this model unique?

The model's specialization in Vietnamese language processing and its impressive performance metrics (73.28% Acc@1, 93.59% Acc@10) on legal text retrieval make it stand out. It's particularly noteworthy for its comprehensive training on translated international datasets combined with local Vietnamese legal texts.

Q: What are the recommended use cases?

The model is ideal for Vietnamese language applications requiring semantic similarity assessment, including document retrieval systems, semantic search engines, and text clustering applications. It's particularly effective for legal text processing and general semantic analysis tasks in Vietnamese.

The first platform built for prompt engineering