roberta-ko-small-tsdae
Property | Value |
---|---|
License | MIT |
Language | Korean |
Paper | TSDAE Paper |
Vector Dimension | 256 |
What is roberta-ko-small-tsdae?
roberta-ko-small-tsdae is a specialized Korean language model based on RoBERTa architecture, pre-trained using the TSDAE (Transformer-based Denoising Auto-Encoder) approach. It's designed to map Korean sentences and paragraphs into 256-dimensional dense vector spaces, making it particularly effective for semantic search and clustering tasks.
Implementation Details
The model utilizes the sentence-transformers framework and maintains the same architecture as lassl/roberta-ko-small but with a different tokenizer. It implements CLS pooling for sentence embeddings and achieves impressive performance on the KLUE STS dataset without fine-tuning.
- Achieves 0.8735 cosine Pearson correlation on KLUE STS training set
- Supports both sentence-transformers and HuggingFace Transformers implementations
- Includes built-in padding and truncation capabilities
Core Capabilities
- Sentence similarity computation
- Semantic text embedding generation
- Paraphrase mining
- Clustering of similar sentences
Frequently Asked Questions
Q: What makes this model unique?
The model combines TSDAE pre-training with RoBERTa architecture specifically for Korean language processing, offering strong performance on sentence similarity tasks without requiring fine-tuning. It achieves this while maintaining a relatively compact 256-dimensional embedding space.
Q: What are the recommended use cases?
The model is ideal for Korean language applications requiring semantic similarity analysis, including document clustering, semantic search, and paraphrase detection. It can be used either out-of-the-box for sentence similarity tasks or fine-tuned for specific downstream applications.