GIST-Embedding-v0
Property | Value |
---|---|
Model Size | 109M parameters |
Base Model | BAAI/bge-base-en-v1.5 |
License | MIT |
Paper | GISTEmbed Paper |
Training Data | MEDI dataset + MTEB Classification |
What is GIST-Embedding-v0?
GIST-Embedding-v0 is a specialized text embedding model developed using a novel approach called Guided In-sample Selection of Training Negatives (GIST). Built on top of the BGE-base-en-v1.5 architecture, this model has been fine-tuned using a combination of the MEDI dataset and carefully selected triplets from MTEB Classification training data. A key advantage is its ability to generate high-quality embeddings without requiring specific instructions or prompts.
Implementation Details
The model was trained with specific parameters including 80 epochs, a warmup ratio of 0.1, and a learning rate of 5e-6. It employs a contrastive loss temperature of 0.01 and uses batch sizes of 32. The training process involved checkpoint steps at 103,500 iterations.
- No instruction requirement for embedding generation
- Built on proven BERT architecture
- Optimized for semantic search and similarity tasks
- Trained on diverse classification datasets
Core Capabilities
- Text similarity computation
- Semantic search implementation
- Document classification
- Cross-lingual text matching
Frequently Asked Questions
Q: What makes this model unique?
The model's unique feature is its ability to generate high-quality embeddings without requiring instructions, while utilizing a novel guided negative selection approach during training. This makes it particularly efficient for production deployments.
Q: What are the recommended use cases?
The model excels in semantic search, document similarity matching, and classification tasks. It's particularly well-suited for applications requiring efficient text embedding without the overhead of instruction engineering.