Florence-2-DocVQA

Maintained By
HuggingFaceM4

Florence-2-DocVQA

PropertyValue
Parameter Count823M
Model TypeImage-Text-to-Text
LicenseMIT
Base ModelFlorence-2-large-ft
Training DataDocmatix (5%)

What is Florence-2-DocVQA?

Florence-2-DocVQA is a specialized model fine-tuned from Microsoft's Florence-2 base model, specifically designed for document visual question answering tasks. The model was trained for one day using 5% of the Docmatix dataset with a carefully calibrated learning rate of 1e-6, making it particularly effective for understanding and answering questions about document images.

Implementation Details

The model leverages the transformer architecture and is implemented using the Hugging Face transformers library. It operates with F32 tensor type precision and contains 823M parameters, striking a balance between computational efficiency and performance.

  • Built on Florence-2-large-ft architecture
  • Implements image-text-to-text pipeline
  • Uses safetensors for model storage
  • Supports custom code integration

Core Capabilities

  • Document visual question answering
  • Image-text understanding
  • Text generation from visual inputs
  • English language processing

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its specialized fine-tuning for document VQA tasks using the Docmatix dataset, making it particularly effective for understanding and answering questions about document images.

Q: What are the recommended use cases?

The model is best suited for applications requiring document understanding, visual question answering, and text generation from document images. It's particularly valuable for automated document processing and information extraction systems.

The first platform built for prompt engineering