rubert-mini-frida

Property	Value
Model Type	Sentence Embedding Model
Architecture	Distilled BERT-based
Embedding Dimension	312
Number of Layers	7
Context Length	512 tokens (recommended), 2048 max
Author	sergeyzh
Model URL	HuggingFace

What is rubert-mini-frida?

rubert-mini-frida is a lightweight and efficient sentence embedding model created through knowledge distillation from the larger FRIDA model. It's designed to generate embeddings for both Russian and English text while maintaining competitive performance with significantly reduced computational requirements. The model reduces the original FRIDA's 1536-dimensional embeddings and 24 layers to just 312 dimensions and 7 layers, while preserving most of its capabilities.

Implementation Details

The model implements mean pooling instead of the original CLS pooling and supports various prefix modes inherited from FRIDA. It maintains the same context window as its parent model, with a recommended size of 512 tokens.

Supports multiple task-specific prefixes (e.g., "categorize:", "search_query:", "paraphrase:")
Achieves strong performance across various NLP tasks including STS (0.803 Pearson correlation)
Optimized for both Russian and English text processing
Compatible with both transformers and sentence-transformers libraries

Core Capabilities

Semantic Text Similarity (STS)
Paraphrase Identification (PI)
Natural Language Inference (NLI)
Sentiment Analysis (SA)
Toxicity Identification (TI)
Cross-lingual embedding generation

Frequently Asked Questions

Q: What makes this model unique?

The model offers a superior balance between performance and efficiency, achieving 91% of FRIDA's performance while being significantly smaller and faster. It's particularly notable for maintaining strong performance across both Russian and English languages.

Q: What are the recommended use cases?

The model excels in text similarity tasks, document classification, and information retrieval. It's particularly well-suited for applications requiring efficient processing of both Russian and English content, especially in resource-constrained environments.