Llama-3-70B-Instruct-Gradient-262k

Maintained By
gradientai

Llama-3-70B-Instruct-Gradient-262k

PropertyValue
Parameter Count70.6B
Context Length262k tokens
LicenseLlama3
Training Tokens105M tokens (extension training)

What is Llama-3-70B-Instruct-Gradient-262k?

This model is an enhanced version of Meta's Llama-3 70B that extends the context window from 8k to over 262k tokens. Developed by Gradient AI and powered by Crusoe Energy's computing infrastructure, it represents a significant advancement in long-context language models while requiring minimal additional training.

Implementation Details

The model employs NTK-aware interpolation following specific scaling laws to optimize the RoPE theta parameter. The training process involved progressive context length increases, utilizing the EasyContext Blockwise RingAttention library for efficient long-context training.

  • Base model: Meta-Llama-3-70B-Instruct
  • Training data: Augmented SlimPajama and UltraChat datasets
  • Training infrastructure: 512 NVIDIA L40S GPUs
  • Progressive training stages: 65K → 262K context lengths

Core Capabilities

  • Extended context processing up to 262K tokens
  • Maintains base model's instruction-following abilities
  • Efficient processing of long-form content
  • Improved memory retention across long contexts

Frequently Asked Questions

Q: What makes this model unique?

This model achieves dramatic context extension (32x) with minimal additional training (< 0.002% of original pretraining data), demonstrating efficient adaptation of large language models to longer contexts.

Q: What are the recommended use cases?

The model is ideal for tasks requiring long-context understanding such as document analysis, extended conversations, and complex multi-step reasoning across long texts.

The first platform built for prompt engineering