ModernBERT-large
Property | Value |
---|---|
Parameter Count | 395 million |
Architecture | 28-layer Transformer encoder |
Context Length | 8,192 tokens |
Training Data | 2 trillion tokens (English + code) |
License | Apache 2.0 |
Paper | arXiv:2412.13663 |
What is ModernBERT-large?
ModernBERT-large is a state-of-the-art bidirectional encoder model that modernizes the traditional BERT architecture with cutting-edge improvements. It represents a significant advancement in transformer-based models, incorporating features like Rotary Positional Embeddings (RoPE) and Local-Global Alternating Attention for efficient processing of long sequences.
Implementation Details
The model utilizes a pre-norm transformer architecture with GeGLU activations and was trained using StableAdamW optimizer with trapezoidal learning rate scheduling. It employs Flash Attention and unpadding techniques for optimal inference performance.
- 28 transformer layers with modern architectural improvements
- Native support for sequences up to 8,192 tokens
- Trained on both text and code data for versatile applications
- Implements efficient attention mechanisms for better performance
Core Capabilities
- Achieves 90.4 on GLUE benchmark, second only to DeBERTa-v3-large
- Superior performance in code retrieval tasks (59.5 on CodeSearchNet)
- Excellent results in long-context retrieval (80.4 on MLDR_OOD)
- Efficient processing of long documents for classification and semantic search
Frequently Asked Questions
Q: What makes this model unique?
ModernBERT-large combines recent architectural innovations with extensive pretraining on diverse data sources, resulting in superior performance across various tasks while maintaining efficient processing of long sequences. Its integration of RoPE and Flash Attention makes it particularly well-suited for modern applications requiring long-context understanding.
Q: What are the recommended use cases?
The model excels in tasks requiring long document processing, including document retrieval, classification, and semantic search. It's particularly effective for hybrid applications involving both code and text, making it ideal for technical documentation search, code retrieval, and general natural language understanding tasks.