Multilingual-E5-small Text Embeddings
Property | Value |
---|---|
Parameter Count | 118M |
License | MIT |
Paper | Multilingual E5 Text Embeddings: A Technical Report |
Languages Supported | 100+ |
What is sentence-transformers-multilingual-e5-small?
Multilingual-E5-small is a compact yet powerful text embedding model designed for cross-lingual applications. Built on a 12-layer architecture with 384-dimensional embeddings, it supports over 100 languages and excels in semantic search, retrieval, and similarity tasks. The model utilizes an innovative prefix-based approach ("query:" and "passage:") for different types of input processing.
Implementation Details
The model was developed through a two-stage training process: first with contrastive pre-training using weak supervision on massive multilingual datasets (including mC4, CC News, NLLB), followed by supervised fine-tuning on high-quality labeled datasets. It incorporates specialized prefix tokens for different use cases, enabling optimal performance across various tasks.
- Initialized from microsoft/Multilingual-MiniLM-L12-H384
- Trained on 1B+ text pairs for contrastive learning
- Fine-tuned on datasets like MS MARCO, NQ, and multilingual resources
Core Capabilities
- Cross-lingual semantic search and retrieval
- Text similarity assessment across 100+ languages
- Bitext mining and parallel text alignment
- Document classification and clustering
- Semantic textual similarity (STS) tasks
Frequently Asked Questions
Q: What makes this model unique?
The model combines compact size (118M parameters) with strong multilingual capabilities, using innovative prefix-based encoding and achieving state-of-the-art performance on multilingual benchmarks like Mr. TyDi.
Q: What are the recommended use cases?
The model excels in cross-lingual information retrieval, semantic search, and text similarity tasks. Use "query:" prefix for symmetric tasks and both "query:" and "passage:" for asymmetric tasks like passage retrieval.