ModernBERT-large

Property	Value
Parameter Count	395 million
Architecture	28-layer Transformer encoder
Context Length	8,192 tokens
Training Data	2 trillion tokens (English + code)
License	Apache 2.0
Paper	arXiv:2412.13663

What is ModernBERT-large?

ModernBERT-large is a state-of-the-art bidirectional encoder model that modernizes the traditional BERT architecture with cutting-edge improvements. It represents a significant advancement in transformer-based models, incorporating features like Rotary Positional Embeddings (RoPE) and Local-Global Alternating Attention for efficient processing of long sequences.

Implementation Details

The model utilizes a pre-norm transformer architecture with GeGLU activations and was trained using StableAdamW optimizer with trapezoidal learning rate scheduling. It employs Flash Attention and unpadding techniques for optimal inference performance.

28 transformer layers with modern architectural improvements
Native support for sequences up to 8,192 tokens
Trained on both text and code data for versatile applications
Implements efficient attention mechanisms for better performance

Core Capabilities

Achieves 90.4 on GLUE benchmark, second only to DeBERTa-v3-large
Superior performance in code retrieval tasks (59.5 on CodeSearchNet)
Excellent results in long-context retrieval (80.4 on MLDR_OOD)
Efficient processing of long documents for classification and semantic search

Frequently Asked Questions

Q: What makes this model unique?

ModernBERT-large combines recent architectural innovations with extensive pretraining on diverse data sources, resulting in superior performance across various tasks while maintaining efficient processing of long sequences. Its integration of RoPE and Flash Attention makes it particularly well-suited for modern applications requiring long-context understanding.

Q: What are the recommended use cases?

The model excels in tasks requiring long document processing, including document retrieval, classification, and semantic search. It's particularly effective for hybrid applications involving both code and text, making it ideal for technical documentation search, code retrieval, and general natural language understanding tasks.