Multilingual-E5-Base
Property | Value |
---|---|
Parameter Count | 278M |
License | MIT |
Paper | Multilingual E5 Text Embeddings: A Technical Report |
Languages Supported | 94 languages |
What is multilingual-e5-base?
Multilingual-E5-Base is a powerful text embedding model designed for cross-lingual understanding and retrieval tasks. Built on the XLM-RoBERTa architecture, it features 12 transformer layers and produces 768-dimensional embeddings. The model was trained on over 1 billion text pairs across multiple languages and fine-tuned on diverse supervised datasets.
Implementation Details
The model follows a two-stage training process: first, contrastive pre-training with weak supervision on massive multilingual datasets including mC4, CC News, and NLLB, followed by supervised fine-tuning on high-quality datasets like MS MARCO, NQ, and multilingual retrieval datasets.
- Architecture: 12-layer transformer with 768-dimensional embeddings
- Training Data: 1B+ text pairs from diverse sources
- Input Format: Requires "query:" or "passage:" prefixes for optimal performance
- Supported Tasks: Retrieval, semantic similarity, clustering, classification
Core Capabilities
- Strong performance on cross-lingual retrieval (70.5% MRR@10 on Mr. TyDi)
- Effective text embeddings for 94 languages
- State-of-the-art results on MTEB benchmark
- Flexible integration with popular frameworks like sentence-transformers
Frequently Asked Questions
Q: What makes this model unique?
The model combines extensive multilingual pre-training with targeted supervised fine-tuning, achieving strong performance across languages while maintaining efficient architecture size.
Q: What are the recommended use cases?
The model excels at cross-lingual information retrieval, semantic similarity computation, and document clustering. It's particularly effective for applications requiring multilingual understanding.