Florence-2-base
Property | Value |
---|---|
Parameter Count | 0.23B |
License | MIT |
Paper | arXiv:2311.06242 |
Architecture | Transformers-based Vision Foundation Model |
What is Florence-2-base?
Florence-2-base is a compact yet powerful vision foundation model developed by Microsoft that uses a prompt-based approach to handle various vision and vision-language tasks. Trained on the massive FLD-5B dataset containing 5.4 billion annotations across 126 million images, it represents a significant advancement in multi-task visual understanding.
Implementation Details
The model implements a sequence-to-sequence architecture optimized for both zero-shot and fine-tuned applications. It utilizes PyTorch and the Transformers library, supporting float16 precision for efficient processing.
- Leverages prompt-based task specification
- Processes both image and text inputs
- Supports batch processing and GPU acceleration
- Implements beam search for generation tasks
Core Capabilities
- Image Captioning (with multiple detail levels)
- Object Detection (mAP 34.7 on COCO val2017)
- Dense Region Captioning
- OCR with Region Detection
- Phrase Grounding
- Visual Question Answering
Frequently Asked Questions
Q: What makes this model unique?
Florence-2-base stands out for its ability to handle multiple vision tasks through simple prompts without task-specific fine-tuning, achieving strong zero-shot performance despite its relatively small size of 0.23B parameters.
Q: What are the recommended use cases?
The model excels in various computer vision tasks including image captioning (133.0 CIDEr on COCO), object detection, OCR, and visual grounding. It's particularly suitable for applications requiring multiple vision capabilities in a single model.