CogFlorence-2.2-Large

Maintained By
thwri

CogFlorence-2.2-Large

PropertyValue
Parameter Count823M
Model TypeImage-to-Text
LicenseMIT
PrecisionFP16

What is CogFlorence-2.2-Large?

CogFlorence-2.2-Large is an advanced image-to-text model that builds upon Microsoft's Florence-2-large architecture. This model has been specifically fine-tuned on a carefully curated dataset of 40,000 images from Ejafa/ye-pop, with captions generated using the powerful THUDM/cogvlm2-llama3-chat-19B model and refined with google/gemma-2-9b.

Implementation Details

The model employs a sophisticated training approach with a frozen vision encoder to maintain stability. Training was conducted with a batch size of 64, using gradient accumulation steps of 16, and an AdamW optimizer with a polynomial scheduler. The learning rate was set to 5.12e-05 over 8.36 epochs.

  • Frozen vision encoder architecture
  • Optimized training parameters for stability
  • Post-processed captions for enhanced clarity
  • Efficient FP16 precision implementation

Core Capabilities

  • Detailed image caption generation
  • High-quality visual understanding
  • Efficient processing with reduced precision
  • Robust handling of diverse image types

Frequently Asked Questions

Q: What makes this model unique?

This model combines the powerful Florence-2-large architecture with carefully curated training data and sophisticated caption generation using CogVLM2, making it particularly effective for detailed image description tasks.

Q: What are the recommended use cases?

The model excels in generating detailed, context-aware image captions, making it ideal for content description, accessibility applications, and automated image cataloging systems.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.