sarashina-embedding-v1-1b

Maintained By
sbintuitions

Sarashina-Embedding-v1-1B

PropertyValue
Base ModelSarashina2.1-1B
Embedding Dimensions1,792
Max Sequence Length8,192 tokens
LanguageJapanese
LicenseSarashina Model NonCommercial License Agreement
Model URLhttps://huggingface.co/sbintuitions/sarashina-embedding-v1-1b

What is sarashina-embedding-v1-1b?

Sarashina-embedding-v1-1b is a state-of-the-art Japanese text embedding model that transforms Japanese text into 1792-dimensional dense vectors. Based on the 1.2B-parameter Japanese LLM Sarashina2.1-1B, it achieves the highest average score across 16 datasets in JMTEB (Japanese Massive Text Embedding Benchmark).

Implementation Details

The model employs a two-stage training process: initial weakly-supervised learning using over 126 million training examples from diverse sources, followed by supervised fine-tuning on 233,072 carefully selected examples. The architecture uses SentenceTransformer with a LlamaModel backbone and specialized pooling mechanisms.

  • Advanced pooling configuration with lasttoken mode
  • Cosine similarity as the primary comparison metric
  • Extensive training on multiple Japanese datasets including Wiki QA, MQA, and custom web-crawled data

Core Capabilities

  • Semantic textual similarity analysis
  • Semantic search implementation
  • Paraphrase mining
  • Text classification and clustering
  • High-performance on JMTEB with 75.50% average score

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its superior performance on Japanese text embedding tasks, outperforming other models including OpenAI's text-embedding-3-large on JMTEB. Its unique training process and large context window of 8,192 tokens make it particularly powerful for Japanese language applications.

Q: What are the recommended use cases?

The model is ideal for Japanese text analysis tasks including semantic search, document similarity comparison, clustering, and classification. It's particularly suited for non-commercial applications requiring deep understanding of Japanese text semantics.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.