CogFlorence-2.2-Large
Property | Value |
---|---|
Parameter Count | 823M |
Model Type | Image-to-Text |
License | MIT |
Precision | FP16 |
What is CogFlorence-2.2-Large?
CogFlorence-2.2-Large is an advanced image-to-text model that builds upon Microsoft's Florence-2-large architecture. This model has been specifically fine-tuned on a carefully curated dataset of 40,000 images from Ejafa/ye-pop, with captions generated using the powerful THUDM/cogvlm2-llama3-chat-19B model and refined with google/gemma-2-9b.
Implementation Details
The model employs a sophisticated training approach with a frozen vision encoder to maintain stability. Training was conducted with a batch size of 64, using gradient accumulation steps of 16, and an AdamW optimizer with a polynomial scheduler. The learning rate was set to 5.12e-05 over 8.36 epochs.
- Frozen vision encoder architecture
- Optimized training parameters for stability
- Post-processed captions for enhanced clarity
- Efficient FP16 precision implementation
Core Capabilities
- Detailed image caption generation
- High-quality visual understanding
- Efficient processing with reduced precision
- Robust handling of diverse image types
Frequently Asked Questions
Q: What makes this model unique?
This model combines the powerful Florence-2-large architecture with carefully curated training data and sophisticated caption generation using CogVLM2, making it particularly effective for detailed image description tasks.
Q: What are the recommended use cases?
The model excels in generating detailed, context-aware image captions, making it ideal for content description, accessibility applications, and automated image cataloging systems.