Florence-2-base-PromptGen-v1.5

Maintained By
MiaoshouAI

Florence-2-base-PromptGen-v1.5

PropertyValue
Parameter Count271M
LicenseMIT
Tensor TypeF32
AuthorMiaoshouAI

What is Florence-2-base-PromptGen-v1.5?

Florence-2-base-PromptGen-v1.5 is an advanced image captioning model specifically designed for AI art workflows. Built on Microsoft's Florence-2 architecture, this model represents a significant upgrade with new caption instructions and improved accuracy. It's specifically optimized for generating detailed image descriptions and tags, making it particularly valuable for AI image generation pipelines.

Implementation Details

The model implements a sophisticated captioning system with multiple instruction modes: GENERATE_TAGS, CAPTION, DETAILED_CAPTION, MORE_DETAILED_CAPTION, and MIXED_CAPTION. It operates with remarkable efficiency, requiring just over 1GB of VRAM while delivering fast and high-quality image captions.

  • Memory-efficient architecture requiring minimal VRAM
  • Compatible with Flux model for both T5XXL CLIP and CLIP_L
  • Supports structured caption format with position detection
  • Improved watermark recognition capabilities

Core Capabilities

  • Generates Danbooru-style tags with the GENERATE_TAGS instruction
  • Produces structured captions with subject position information
  • Creates detailed descriptions with MORE_DETAILED_CAPTION mode
  • Supports mixed caption style for FLUX model integration
  • Efficient processing with low VRAM requirements

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized training focused on AI art workflows, avoiding common issues like Lora trigger words and inaccurate tags. It's designed specifically for prompt generation and image tagging, unlike general-purpose vision models.

Q: What are the recommended use cases?

The model is ideal for AI artists and developers working with image generation pipelines, particularly those using ComfyUI. It's especially useful for creating detailed image descriptions, generating tags, and working with Flux models requiring both T5XXL and CLIP_L encoding.

The first platform built for prompt engineering