Yarn-Mistral-7b-128k
Property | Value |
---|---|
Base Model | Mistral-7B-v0.1 |
Context Window | 128,000 tokens |
License | Apache 2.0 |
Paper | arXiv:2309.00071 |
What is Yarn-Mistral-7b-128k?
Yarn-Mistral-7b-128k is an advanced language model that extends the capabilities of Mistral-7B to handle significantly longer contexts. It has been further pretrained on long-context data for 1,500 steps using the YaRN extension method, enabling it to process up to 128,000 tokens while maintaining strong performance.
Implementation Details
The model requires specific implementation considerations, including the use of Flash Attention 2 and bfloat16 precision. It must be loaded with trust_remote_code=True and requires the latest version of the transformers library.
- Supports 128k token context window
- Built on Mistral-7B architecture
- Utilizes Flash Attention 2 technology
- Requires latest transformers library
Core Capabilities
- Exceptional long-context performance with PPL scores of 2.19 at 128k context
- Maintains strong performance on standard benchmarks (ARC-c: 58.87, Hellaswag: 80.58)
- Minimal degradation in short-context tasks compared to base Mistral-7B
- Optimized for both long and short-context applications
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its ability to handle extremely long contexts (128k tokens) while maintaining performance comparable to the original Mistral-7B model. It shows impressive perplexity scores across various context lengths and minimal degradation in standard benchmark tasks.
Q: What are the recommended use cases?
The model is particularly well-suited for applications requiring long-context understanding, such as document analysis, extended conversations, and complex text processing tasks. It maintains strong performance in both long and short-context scenarios, making it versatile for various applications.