Ruri-large-v2

Property	Value
Parameter Count	337M
Model Type	Sentence Transformer
Output Dimensions	1024
Max Sequence Length	512 tokens
License	Apache 2.0
Paper	arXiv:2409.07737

What is ruri-large-v2?

Ruri-large-v2 is a state-of-the-art Japanese text embedding model developed by cl-nagoya. It represents the latest iteration in the Ruri model series, designed specifically for Japanese language understanding. The model achieves an impressive 74.55% average score on JMTEB benchmarks, demonstrating significant improvements over its predecessors and competing models.

Implementation Details

Built on the sentence-transformers framework, Ruri-large-v2 utilizes a sophisticated architecture combining a BERT-based transformer with specialized pooling mechanisms. The model requires specific text prefixes ("クエリ: " for queries and "文章: " for passages) and processes sequences up to 512 tokens, outputting 1024-dimensional embeddings.

Advanced pooling layer with mean tokens strategy
Cosine similarity-based text comparison
Optimized for Japanese language processing
Requires minimal preprocessing with fugashi and sentencepiece

Core Capabilities

Strong performance in retrieval tasks (76.34% on JMTEB retrieval benchmark)
Excellent semantic textual similarity (83.17% on STS tasks)
Robust classification capabilities (77.18% accuracy)
High-quality reranking performance (93.21%)

Frequently Asked Questions

Q: What makes this model unique?

Ruri-large-v2 stands out for its specialized Japanese language understanding and state-of-the-art performance across multiple NLP tasks. It particularly excels in retrieval and semantic similarity tasks, outperforming many multilingual alternatives.

Q: What are the recommended use cases?

The model is ideal for Japanese text embedding tasks including semantic search, document similarity analysis, text classification, and information retrieval systems. It's particularly well-suited for applications requiring high-precision text matching and understanding in Japanese.

ruri-large-v2