Phi-3-small-128k-instruct

Property	Value
Parameter Count	7.39B
Context Length	128,000 tokens
License	MIT
Author	Microsoft

What is Phi-3-small-128k-instruct?

Phi-3-small-128k-instruct is a state-of-the-art lightweight language model developed by Microsoft, designed to deliver exceptional performance in reasoning tasks while maintaining efficiency. This 7B parameter model features an impressive 128K token context window and has been specifically optimized through supervised fine-tuning and direct preference optimization for instruction-following and safety measures.

Implementation Details

The model utilizes a dense decoder-only Transformer architecture with alternating dense and blocksparse attention mechanisms. It was trained on 4.8T tokens including high-quality educational data, synthetic textbook-like content, and carefully filtered public documents.

Supports multiple languages with 10% multilingual training data
Optimized for memory/compute constrained environments
Implements Flash Attention 2 and Triton blocksparse attention
Compatible with NVIDIA A100, A6000, and H100 GPUs

Core Capabilities

Strong performance in reasoning tasks, especially code, math, and logic
Extended context handling up to 128K tokens
High accuracy in benchmark tests, outperforming larger models in specific tasks
Efficient processing in latency-bound scenarios
Robust safety measures through preference optimization

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its ability to match or exceed the performance of larger models while maintaining a relatively small 7B parameter size. It particularly excels in reasoning tasks and offers an exceptional 128K token context window, making it suitable for processing lengthy documents and complex problems.

Q: What are the recommended use cases?

The model is ideal for applications requiring strong reasoning capabilities, particularly in code generation, mathematical problem-solving, and logical reasoning. It's especially suitable for deployment in resource-constrained environments or applications requiring low-latency responses.