roberta-ko-small-tsdae

Property	Value
License	MIT
Language	Korean
Paper	TSDAE Paper
Vector Dimension	256

What is roberta-ko-small-tsdae?

roberta-ko-small-tsdae is a specialized Korean language model based on RoBERTa architecture, pre-trained using the TSDAE (Transformer-based Denoising Auto-Encoder) approach. It's designed to map Korean sentences and paragraphs into 256-dimensional dense vector spaces, making it particularly effective for semantic search and clustering tasks.

Implementation Details

The model utilizes the sentence-transformers framework and maintains the same architecture as lassl/roberta-ko-small but with a different tokenizer. It implements CLS pooling for sentence embeddings and achieves impressive performance on the KLUE STS dataset without fine-tuning.

Achieves 0.8735 cosine Pearson correlation on KLUE STS training set
Supports both sentence-transformers and HuggingFace Transformers implementations
Includes built-in padding and truncation capabilities

Core Capabilities

Sentence similarity computation
Semantic text embedding generation
Paraphrase mining
Clustering of similar sentences

Frequently Asked Questions

Q: What makes this model unique?

The model combines TSDAE pre-training with RoBERTa architecture specifically for Korean language processing, offering strong performance on sentence similarity tasks without requiring fine-tuning. It achieves this while maintaining a relatively compact 256-dimensional embedding space.

Q: What are the recommended use cases?

The model is ideal for Korean language applications requiring semantic similarity analysis, including document clustering, semantic search, and paraphrase detection. It can be used either out-of-the-box for sentence similarity tasks or fine-tuned for specific downstream applications.