Phi-3-vision-128k-instruct

Maintained By
microsoft

Phi-3-vision-128k-instruct

PropertyValue
Parameter Count4.15B
Context Length128K tokens
LicenseMIT
Training Hardware512 H100-80G GPUs
Training Duration1.5 days

What is Phi-3-vision-128k-instruct?

Phi-3-vision-128k-instruct is Microsoft's state-of-the-art multimodal AI model that combines vision and language capabilities in a lightweight package. Built on high-quality synthetic data and filtered public datasets, it excels at both text and vision tasks while maintaining efficient resource usage. The model features an impressive 128K token context length and has been fine-tuned through supervised learning and direct preference optimization.

Implementation Details

The model architecture integrates an image encoder, connector, projector, and the Phi-3 Mini language model. It processes both text and images using a specialized chat format and requires specific GPU hardware (like NVIDIA A100, A6000, or H100) due to its flash attention implementation. Training involved 500B vision and text tokens, with development occurring between February and April 2024.

  • Flash Attention 2 implementation for efficient processing
  • BF16 tensor type optimization
  • Comprehensive safety measures and instruction adherence
  • Built-in support for OCR and chart understanding

Core Capabilities

  • General image understanding and analysis
  • OCR and text extraction from images
  • Chart and table comprehension
  • Multilingual support (primary focus on English)
  • Extended context handling up to 128K tokens

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its efficient design that delivers high performance with relatively modest parameters (4.15B) while maintaining impressive benchmark scores across various vision-language tasks. It excels particularly in scientific and mathematical visual reasoning tasks.

Q: What are the recommended use cases?

The model is ideal for commercial and research applications requiring memory/compute constrained environments, latency-sensitive scenarios, general image understanding, OCR, and chart/table analysis. It's particularly well-suited for building generative AI features in production environments.

The first platform built for prompt engineering