Florence-2-large

Property	Value
Parameter Count	0.77B
License	MIT
Paper	Link
Author	Microsoft
Model Type	Vision Foundation Model

What is Florence-2-large?

Florence-2-large is an advanced vision foundation model developed by Microsoft that leverages a prompt-based approach to handle multiple vision and vision-language tasks. Built on a massive FLD-5B dataset containing 5.4 billion annotations across 126 million images, this 0.77B parameter model demonstrates impressive capabilities in both zero-shot and fine-tuned settings.

Implementation Details

The model utilizes a sequence-to-sequence architecture and is implemented using HuggingFace's transformers library. It operates with float16 precision on GPU and supports various vision tasks through simple prompt engineering.

Supports multiple vision tasks through specialized prompts
Trained on FLD-5B dataset with comprehensive annotations
Implements efficient inference with float16 precision
Achieves state-of-the-art performance in various benchmarks

Core Capabilities

Image Captioning (Basic, Detailed, and More Detailed)
Object Detection with confidence scores
Dense Region Caption generation
OCR and OCR with Region detection
Caption to Phrase Grounding
Region Proposal generation

Frequently Asked Questions

Q: What makes this model unique?

Florence-2-large stands out for its ability to handle multiple vision tasks with a single model using simple prompts, achieving competitive performance with just 0.77B parameters compared to much larger models. It demonstrates strong zero-shot capabilities and can be fine-tuned for improved performance.

Q: What are the recommended use cases?

The model excels in various vision tasks including image captioning (achieving 135.6 CIDEr score on COCO), object detection (37.5 mAP on COCO), and reference visual grounding (84.4% on Flickr30k). It's particularly suitable for applications requiring multi-task vision processing.

Florence-2-large

Florence-2-large

What is Florence-2-large?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models