GIST-Embedding-v0

Property	Value
Model Size	109M parameters
Base Model	BAAI/bge-base-en-v1.5
License	MIT
Paper	GISTEmbed Paper
Training Data	MEDI dataset + MTEB Classification

What is GIST-Embedding-v0?

GIST-Embedding-v0 is a specialized text embedding model developed using a novel approach called Guided In-sample Selection of Training Negatives (GIST). Built on top of the BGE-base-en-v1.5 architecture, this model has been fine-tuned using a combination of the MEDI dataset and carefully selected triplets from MTEB Classification training data. A key advantage is its ability to generate high-quality embeddings without requiring specific instructions or prompts.

Implementation Details

The model was trained with specific parameters including 80 epochs, a warmup ratio of 0.1, and a learning rate of 5e-6. It employs a contrastive loss temperature of 0.01 and uses batch sizes of 32. The training process involved checkpoint steps at 103,500 iterations.

No instruction requirement for embedding generation
Built on proven BERT architecture
Optimized for semantic search and similarity tasks
Trained on diverse classification datasets

Core Capabilities

Text similarity computation
Semantic search implementation
Document classification
Cross-lingual text matching

Frequently Asked Questions

Q: What makes this model unique?

The model's unique feature is its ability to generate high-quality embeddings without requiring instructions, while utilizing a novel guided negative selection approach during training. This makes it particularly efficient for production deployments.

Q: What are the recommended use cases?

The model excels in semantic search, document similarity matching, and classification tasks. It's particularly well-suited for applications requiring efficient text embedding without the overhead of instruction engineering.