Sarashina-Embedding-v1-1B
Property | Value |
---|---|
Base Model | Sarashina2.1-1B |
Embedding Dimensions | 1,792 |
Max Sequence Length | 8,192 tokens |
Language | Japanese |
License | Sarashina Model NonCommercial License Agreement |
Model URL | https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b |
What is sarashina-embedding-v1-1b?
Sarashina-embedding-v1-1b is a state-of-the-art Japanese text embedding model that transforms Japanese text into 1792-dimensional dense vectors. Based on the 1.2B-parameter Japanese LLM Sarashina2.1-1B, it achieves the highest average score across 16 datasets in JMTEB (Japanese Massive Text Embedding Benchmark).
Implementation Details
The model employs a two-stage training process: initial weakly-supervised learning using over 126 million training examples from diverse sources, followed by supervised fine-tuning on 233,072 carefully selected examples. The architecture uses SentenceTransformer with a LlamaModel backbone and specialized pooling mechanisms.
- Advanced pooling configuration with lasttoken mode
- Cosine similarity as the primary comparison metric
- Extensive training on multiple Japanese datasets including Wiki QA, MQA, and custom web-crawled data
Core Capabilities
- Semantic textual similarity analysis
- Semantic search implementation
- Paraphrase mining
- Text classification and clustering
- High-performance on JMTEB with 75.50% average score
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its superior performance on Japanese text embedding tasks, outperforming other models including OpenAI's text-embedding-3-large on JMTEB. Its unique training process and large context window of 8,192 tokens make it particularly powerful for Japanese language applications.
Q: What are the recommended use cases?
The model is ideal for Japanese text analysis tasks including semantic search, document similarity comparison, clustering, and classification. It's particularly suited for non-commercial applications requiring deep understanding of Japanese text semantics.