multilingual-e5-base

Maintained By
intfloat

Multilingual-E5-Base

PropertyValue
Parameter Count278M
LicenseMIT
PaperMultilingual E5 Text Embeddings: A Technical Report
Languages Supported94 languages

What is multilingual-e5-base?

Multilingual-E5-Base is a powerful text embedding model designed for cross-lingual understanding and retrieval tasks. Built on the XLM-RoBERTa architecture, it features 12 transformer layers and produces 768-dimensional embeddings. The model was trained on over 1 billion text pairs across multiple languages and fine-tuned on diverse supervised datasets.

Implementation Details

The model follows a two-stage training process: first, contrastive pre-training with weak supervision on massive multilingual datasets including mC4, CC News, and NLLB, followed by supervised fine-tuning on high-quality datasets like MS MARCO, NQ, and multilingual retrieval datasets.

  • Architecture: 12-layer transformer with 768-dimensional embeddings
  • Training Data: 1B+ text pairs from diverse sources
  • Input Format: Requires "query:" or "passage:" prefixes for optimal performance
  • Supported Tasks: Retrieval, semantic similarity, clustering, classification

Core Capabilities

  • Strong performance on cross-lingual retrieval (70.5% MRR@10 on Mr. TyDi)
  • Effective text embeddings for 94 languages
  • State-of-the-art results on MTEB benchmark
  • Flexible integration with popular frameworks like sentence-transformers

Frequently Asked Questions

Q: What makes this model unique?

The model combines extensive multilingual pre-training with targeted supervised fine-tuning, achieving strong performance across languages while maintaining efficient architecture size.

Q: What are the recommended use cases?

The model excels at cross-lingual information retrieval, semantic similarity computation, and document clustering. It's particularly effective for applications requiring multilingual understanding.

The first platform built for prompt engineering