Stable Diffusion v2

Property	Value
License	CreativeML Open RAIL++-M
Authors	Robin Rombach, Patrick Esser
Training Data	LAION-5B filtered subset
Paper	Latent Diffusion Models

What is stable-diffusion-2?

Stable Diffusion v2 is an advanced text-to-image generation model that builds upon the success of its predecessor. It's a Latent Diffusion Model that combines an autoencoder with a diffusion model trained in latent space, utilizing OpenCLIP-ViT/H as its text encoder. The model supports high-resolution image generation up to 768x768 pixels and implements the v-objective training approach for improved quality.

Implementation Details

The model architecture consists of three main components: an image encoder that converts images into latent representations with a downsampling factor of 8, a text encoder using OpenCLIP-ViT/H, and a UNet backbone that processes the combined information. Training was conducted on 32 A100 GPUs with a batch size of 2048 and AdamW optimizer.

Supports multiple specialized checkpoints: base model, inpainting, depth-aware generation, and upscaling
Implements efficient attention mechanisms through optional xformers integration
Provides flexibility in sampling with various schedulers including DDIM and Euler Discrete

Core Capabilities

High-quality image generation at 768x768 resolution
Improved photorealism compared to previous versions
Text-guided image generation and manipulation
Support for inpainting and depth-aware generation
4x upscaling capabilities

Frequently Asked Questions

Q: What makes this model unique?

This model introduces significant improvements over its predecessor, including better photorealism, higher resolution support (768x768), and the implementation of the v-objective training approach. It also offers specialized versions for different tasks like inpainting and depth-aware generation.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including safe deployment studies, artistic applications, educational tools, and research on generative models. It specifically excludes the generation of harmful content, disinformation, or non-consensual imagery.

stable-diffusion-2