Sarashina-Embedding-v1-1B

Property	Value
Base Model	Sarashina2.1-1B
Embedding Dimensions	1,792
Max Sequence Length	8,192 tokens
Language	Japanese
License	Sarashina Model NonCommercial License Agreement
Model URL	https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b

What is sarashina-embedding-v1-1b?

Sarashina-embedding-v1-1b is a state-of-the-art Japanese text embedding model that transforms Japanese text into 1792-dimensional dense vectors. Based on the 1.2B-parameter Japanese LLM Sarashina2.1-1B, it achieves the highest average score across 16 datasets in JMTEB (Japanese Massive Text Embedding Benchmark).

Implementation Details

The model employs a two-stage training process: initial weakly-supervised learning using over 126 million training examples from diverse sources, followed by supervised fine-tuning on 233,072 carefully selected examples. The architecture uses SentenceTransformer with a LlamaModel backbone and specialized pooling mechanisms.

Advanced pooling configuration with lasttoken mode
Cosine similarity as the primary comparison metric
Extensive training on multiple Japanese datasets including Wiki QA, MQA, and custom web-crawled data

Core Capabilities

Semantic textual similarity analysis
Semantic search implementation
Paraphrase mining
Text classification and clustering
High-performance on JMTEB with 75.50% average score

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its superior performance on Japanese text embedding tasks, outperforming other models including OpenAI's text-embedding-3-large on JMTEB. Its unique training process and large context window of 8,192 tokens make it particularly powerful for Japanese language applications.

Q: What are the recommended use cases?

The model is ideal for Japanese text analysis tasks including semantic search, document similarity comparison, clustering, and classification. It's particularly suited for non-commercial applications requiring deep understanding of Japanese text semantics.