E5-base Text Embedding Model
Property | Value |
---|---|
Parameters | 109M |
Architecture | 12 layers, 768d embeddings |
License | MIT |
Paper | Text Embeddings by Weakly-Supervised Contrastive Pre-training |
What is e5-base?
E5-base is a powerful text embedding model developed through weakly-supervised contrastive pre-training. It's designed to create high-quality semantic representations of text, particularly excelling in tasks like semantic similarity, information retrieval, and text classification. The model requires specific text prefixes ("query:" or "passage:") to maintain optimal performance.
Implementation Details
The model architecture consists of 12 transformer layers with an embedding dimension of 768. It uses average pooling over the last hidden states to generate embeddings, which are then normalized using L2 normalization. The model was trained using a contrastive learning approach with a low temperature of 0.01 for the InfoNCE loss.
- Supports both PyTorch and Sentence-Transformers frameworks
- Maximum sequence length of 512 tokens
- Optimized for English language content
- Achieves strong performance on MTEB benchmark tasks
Core Capabilities
- Text Retrieval and Semantic Search
- Semantic Similarity Assessment
- Document Classification
- Clustering and Information Organization
- Paraphrase Detection
Frequently Asked Questions
Q: What makes this model unique?
E5-base stands out for its efficient architecture and strong performance across various tasks while maintaining a relatively small parameter count of 109M. Its unique prefix-based approach ("query:" and "passage:") enables optimal performance across different use cases.
Q: What are the recommended use cases?
The model excels in semantic search, document retrieval, and similarity matching tasks. It's particularly well-suited for applications requiring symmetric (text-to-text comparison) and asymmetric (query-to-document matching) capabilities.