Florence-2-base

Maintained By
microsoft

Florence-2-base

PropertyValue
Parameter Count0.23B
LicenseMIT
PaperarXiv:2311.06242
ArchitectureTransformers-based Vision Foundation Model

What is Florence-2-base?

Florence-2-base is a compact yet powerful vision foundation model developed by Microsoft that uses a prompt-based approach to handle various vision and vision-language tasks. Trained on the massive FLD-5B dataset containing 5.4 billion annotations across 126 million images, it represents a significant advancement in multi-task visual understanding.

Implementation Details

The model implements a sequence-to-sequence architecture optimized for both zero-shot and fine-tuned applications. It utilizes PyTorch and the Transformers library, supporting float16 precision for efficient processing.

  • Leverages prompt-based task specification
  • Processes both image and text inputs
  • Supports batch processing and GPU acceleration
  • Implements beam search for generation tasks

Core Capabilities

  • Image Captioning (with multiple detail levels)
  • Object Detection (mAP 34.7 on COCO val2017)
  • Dense Region Captioning
  • OCR with Region Detection
  • Phrase Grounding
  • Visual Question Answering

Frequently Asked Questions

Q: What makes this model unique?

Florence-2-base stands out for its ability to handle multiple vision tasks through simple prompts without task-specific fine-tuning, achieving strong zero-shot performance despite its relatively small size of 0.23B parameters.

Q: What are the recommended use cases?

The model excels in various computer vision tasks including image captioning (133.0 CIDEr on COCO), object detection, OCR, and visual grounding. It's particularly suitable for applications requiring multiple vision capabilities in a single model.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.