nomic-embed-text-v1.5
Property | Value |
---|---|
Parameters | 137M |
License | Apache 2.0 |
Context Length | 8192 tokens |
Paper | Nomic Embed: Training a Reproducible Long Context Text Embedder |
What is nomic-embed-text-v1.5?
nomic-embed-text-v1.5 is an advanced text embedding model that implements Matryoshka Representation Learning, allowing flexible dimensionality reduction from 768 to as low as 64 dimensions with minimal performance impact. The model supports long-context processing up to 8192 tokens and is specifically designed for production deployment in search, clustering, and classification tasks.
Implementation Details
The model utilizes a multi-stage training pipeline, starting from a long-context BERT model and incorporating both unsupervised contrastive learning and supervised fine-tuning. It requires specific task instruction prefixes (search_document, search_query, clustering, classification) for optimal performance.
- Supports dynamic dimensionality scaling (768, 512, 256, 128, 64)
- Achieves 62.28 MTEB score at full dimensionality
- Includes built-in sequence length scaling
- Multimodal compatibility with nomic-embed-vision-v1
Core Capabilities
- Document and query embedding for search applications
- Semantic clustering and duplicate detection
- Text classification
- Cross-modal alignment with vision embeddings
- Efficient resource usage through dimensional flexibility
Frequently Asked Questions
Q: What makes this model unique?
The model's Matryoshka architecture allows users to dynamically adjust embedding dimensions without retraining, making it highly flexible for different deployment scenarios while maintaining strong performance. Additionally, its long context support and task-specific instruction prefixes enhance its versatility.
Q: What are the recommended use cases?
The model excels in production deployments for RAG applications, semantic search, document clustering, and classification tasks. Its dimensional flexibility makes it particularly suitable for scenarios where resource constraints vary or need to be optimized.