Phi-3-mini-128k-instruct

Property	Value
Parameter Count	3.8B
Context Length	128,000 tokens
License	MIT
Architecture	Dense decoder-only Transformer
Training Data	4.9T tokens

What is Phi-3-mini-128k-instruct?

Phi-3-mini-128k-instruct is Microsoft's latest compact yet powerful language model, designed to deliver state-of-the-art performance in a remarkably efficient package. This 3.8B parameter model supports an impressive 128,000 token context window, making it suitable for processing lengthy documents while maintaining high performance on reasoning tasks.

Implementation Details

The model utilizes a dense decoder-only Transformer architecture and has been fine-tuned using both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). It's trained on a diverse dataset of 4.9T tokens, including high-quality educational content, synthetic data, and carefully filtered public documents.

Optimized for memory and compute-constrained environments
Supports Flash Attention 2 for improved performance
Includes extensive safety measures and preference alignment
Compatible with multiple platforms through ONNX runtime

Core Capabilities

Strong performance in reasoning tasks, particularly in code, math, and logic
Competitive benchmark scores against larger models (69.7 on MMLU, 85.3 on GSM8K)
Extended context handling for long document processing
Multi-turn conversation support with chat format
Cross-platform deployment options

Frequently Asked Questions

Q: What makes this model unique?

Despite its compact size of 3.8B parameters, Phi-3-mini-128k-instruct achieves performance levels comparable to much larger models, particularly in reasoning tasks. Its 128K token context window and optimization for efficient deployment make it particularly valuable for practical applications.

Q: What are the recommended use cases?

The model excels in scenarios requiring strong reasoning capabilities, including code generation, mathematical problem-solving, and logical analysis. It's particularly well-suited for memory-constrained environments and latency-sensitive applications. The extended context window makes it ideal for long document processing and summarization tasks.