Stable Diffusion v1.4

Property	Value
License	CreativeML OpenRAIL-M
Authors	Robin Rombach, Patrick Esser
Training Infrastructure	32 x 8 x A100 GPUs
Base Model	Stable Diffusion v1.3

What is stable-diffusion-v1-4?

Stable Diffusion v1.4 is an advanced latent text-to-image diffusion model that represents a significant evolution in the field of AI-powered image generation. Built upon its predecessor v1.3, this model leverages a sophisticated latent diffusion architecture combined with a CLIP ViT-L/14 text encoder to generate high-quality images from textual descriptions.

Implementation Details

The model employs a complex architecture that combines an autoencoder with a diffusion model trained in latent space. It processes images through an encoder that transforms them into latent representations, using a downsampling factor of 8. The training procedure utilized AdamW optimizer with a learning rate of 0.0001 and a batch size of 2048, implemented across 32 A100 GPUs.

Utilizes CLIP ViT-L/14 text encoder for processing prompts
Implements cross-attention in the UNet backbone
Supports multiple scheduling algorithms including PLMS and K-LMS
Operates at 512x512 resolution for optimal results

Core Capabilities

High-quality text-to-image generation
Support for artistic and creative applications
Advanced compositional understanding
Classifier-free guidance sampling
Research and educational applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its improved aesthetic capabilities and refined classifier-free guidance sampling, building upon the successful architecture of v1.3. It's particularly notable for its balance between image quality and generation speed.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including safe deployment studies, artistic applications, educational tools, and generative model research. It explicitly excludes the generation of harmful, offensive, or misleading content.