Stable Diffusion, released in 2022, is a deep learning, text-to-image model developed by start-up Stability AI in collaboration with various academic researchers and non-profit organizations. Designed to generate detailed images based on text descriptions, Stable Diffusion can also be applied to inpainting, outpainting, and image-to-image translations guided by a text prompt. Unlike proprietary text-to-image models such as DALL-E and Midjourney, which are accessible only via cloud services, Stable Diffusion's code and model weights have been released publicly, making it available to users with consumer hardware equipped with a modest GPU of at least 8 GB VRAM.
Stable Diffusion uses a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained to remove successive applications of Gaussian noise on training images. Stable Diffusion comprises three parts: a variational autoencoder (VAE), a U-Net, and an optional text encoder. The VAE encoder compresses images into a smaller dimensional latent space, capturing their fundamental semantic meaning. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion, while the U-Net block denoises the output to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space.
The denoising step can be flexibly conditioned on text, images, or other modalities, with encoded conditioning data exposed to denoising U-Nets via a cross-attention mechanism. For conditioning on text, the pretrained CLIP ViT-L/14 text encoder is used to transform text prompts into an embedding space. LDMs offer increased computational efficiency for training and generation.
Stable Diffusion was initially trained on the laion2B-en and laion-high-resolution subsets, with the final rounds of training performed on LAION-Aesthetics v2 5+, a subset of 600 million captioned images. The model was trained using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours, costing $600,000.
Stable Diffusion allows users to generate new images from scratch using text prompts, modify existing images by incorporating new elements described by a text prompt, and partially alter existing images via inpainting and outpainting. The model supports various user interface features, enabling users to modify the weight given to specific parts of the text prompt.
Text-to-image generation is achieved through the txt2img sampling script, which takes a text prompt and outputs an image based on the model's interpretation. Images generated by Stable Diffusion are tagged with an invisible digital watermark to enable identification. Users can adjust the number of inference steps for the sampler, the classifier-free guidance scale value, and provide negative prompts to influence the generated output.
Image modification is performed using the img2img sampling script, which takes a text prompt, an existing image, and a strength value between 0.0 and 1.0. The output is a new image based on the original image, incorporating elements provided within the text prompt. This capability can be useful for data anonymization, data augmentation, image upscaling, and image compression. The strength value determines the amount of noise added to the output image, with higher values producing more variation but potentially less semantic consistency with the provided prompt.
Front-end implementations of Stable Diffusion offer additional use-cases for image modification, such as inpainting and outpainting. Inpainting involves selectively modifying a portion of an existing image delineated by a user-provided layer mask, filling the masked space with newly generated content based on the provided prompt. Outpainting extends an image beyond its original dimensions, filling the previously empty space with content generated based on the provided prompt.
With the release of Stable Diffusion 2.0, a depth-guided model called "depth2img" was introduced. This model infers the depth of the provided input image and generates a new output image based on both the text prompt and the depth information, allowing the coherence and depth of the original input image to be maintained in the generated output.
ControlNet is a neural network architecture designed to manage diffusion models by incorporating additional conditions. It duplicates the weights of neural network blocks into a "locked" copy and a "trainable" copy. The "trainable" copy learns the desired condition, while the "locked" copy preserves the original model. This approach ensures that training with small datasets of image pairs does not compromise the integrity of production-ready diffusion models.
The "zero convolution" is a 1×1 convolution with both weight and bias initialized to zero. Before training, all zero convolutions produce zero output, preventing any distortion caused by ControlNet. No layer is trained from scratch; the process is still fine-tuning, keeping the original model secure. This method enables training on small-scale or even personal devices.
Stable Diffusion is a powerful text-to-image model that offers diverse applications, from generating new images from scratch to modifying existing images through inpainting and outpainting. The model's flexibility, combined with its ability to run on consumer hardware, makes it a valuable tool for developers and researchers alike. With its publicly released code and model weights, Stable Diffusion is poised to inspire further innovation and development in the field of text-to-image generation and deep learning.