Florence-2-large
Property | Value |
---|---|
Parameter Count | 0.77B |
License | MIT |
Paper | Link |
Author | Microsoft |
Model Type | Vision Foundation Model |
What is Florence-2-large?
Florence-2-large is an advanced vision foundation model developed by Microsoft that leverages a prompt-based approach to handle multiple vision and vision-language tasks. Built on a massive FLD-5B dataset containing 5.4 billion annotations across 126 million images, this 0.77B parameter model demonstrates impressive capabilities in both zero-shot and fine-tuned settings.
Implementation Details
The model utilizes a sequence-to-sequence architecture and is implemented using HuggingFace's transformers library. It operates with float16 precision on GPU and supports various vision tasks through simple prompt engineering.
- Supports multiple vision tasks through specialized prompts
- Trained on FLD-5B dataset with comprehensive annotations
- Implements efficient inference with float16 precision
- Achieves state-of-the-art performance in various benchmarks
Core Capabilities
- Image Captioning (Basic, Detailed, and More Detailed)
- Object Detection with confidence scores
- Dense Region Caption generation
- OCR and OCR with Region detection
- Caption to Phrase Grounding
- Region Proposal generation
Frequently Asked Questions
Q: What makes this model unique?
Florence-2-large stands out for its ability to handle multiple vision tasks with a single model using simple prompts, achieving competitive performance with just 0.77B parameters compared to much larger models. It demonstrates strong zero-shot capabilities and can be fine-tuned for improved performance.
Q: What are the recommended use cases?
The model excels in various vision tasks including image captioning (achieving 135.6 CIDEr score on COCO), object detection (37.5 mAP on COCO), and reference visual grounding (84.4% on Flickr30k). It's particularly suitable for applications requiring multi-task vision processing.