Adding Conditional Control to Text-to-Image Diffusion Models

Published 10 Feb 2023 in cs.CV, cs.AI, cs.GR, cs.HC, and cs.MM | (2302.05543v3)

Abstract: We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.

Abstract PDF HTML Upgrade to Chat

References (99)

Citations (2,975)

View on Semantic Scholar

Summary

The paper introduces ControlNet, a novel architecture that enhances text-to-image diffusion models by enabling spatial conditioning controls.
It employs zero-initialized convolution layers to integrate locked pretrained blocks with a trainable copy, ensuring stable, noise-free training.
Experiments show ControlNet's versatility and efficiency across varied conditions, achieving high-quality, semantically aligned image outputs.

Adding Conditional Control to Text-to-Image Diffusion Models

Introduction

The paper "Adding Conditional Control to Text-to-Image Diffusion Models" (2302.05543) introduces a novel neural architecture, ControlNet, designed to enhance large, pretrained text-to-image diffusion models. The primary aim is to enable spatial conditioning controls in Stable Diffusion, thereby providing users with greater flexibility over image generation. This architecture locks the parameters of production-ready diffusion models, creating a trainable copy connected through zero-initialized convolution layers, ensuring robust training by preventing harmful noise during fine-tuning.

Methodology

ControlNet enriches text-to-image diffusion models by incorporating spatially localized, task-specific image conditions. It involves locking the original model parameters and attaching a trainable copy using zero convolution layers. This setup allows the model to leverage pretrained layers effectively while progressively learning specific conditional controls without introducing noise.

Figure 1: A neural block takes a feature map $x$ as input and outputs another feature map $y$ , as shown in (a). To add a ControlNet to such a block we lock the original block and create a trainable copy and connect them together using zero convolution layers.

ControlNet Structure

ControlNet integrates seamlessly into the architecture of diffusion models like Stable Diffusion by locking 12 encoding blocks and a middle block, utilizing zero convolution layers to connect locked and trainable copies. This design is computationally efficient, requiring no gradient computation for frozen parameters and sparing considerable GPU memory.

Figure 2: Stable Diffusion's U-net architecture connected with a ControlNet on the encoder blocks and middle block. The locked, gray blocks show the structure of Stable Diffusion V1.5.

Training and Inference

The training objective leverages noise prediction with zero convolution layers safeguarding against detrimental noise. Additionally, ControlNet training exhibits a sudden convergence phenomenon, quickly adopting conditioning images as shown in the training graph.

Figure 3: The sudden convergence phenomenon. ControlNet always predicts high-quality images during the entire training. At a certain step, the model suddenly learns to follow the input condition.

Classifier-Free Guidance (CFG) enhancement, as demonstrated by CFG Resolution Weighting, facilitates improved image generation by adjusting the influence of conditioned inputs on model outputs.

Figure 4: Effect of Classifier-Free Guidance (CFG) and the proposed CFG Resolution Weighting (CFG-RW).

Experimentation

ControlNet showcases versatility across various conditional inputs, including depth maps, edges, segmentation maps, and poses, facilitating complex image compositions without textual prompts.

Figure 5: Composition of multiple conditions. We present the application to use depth and pose simultaneously.

Qualitative results display ControlNet's proficiency in handling diverse conditions and generating visually cohesive outputs, aligned with input semantics.

Figure 6: Controlling Stable Diffusion with various conditions without prompts. The top row is input conditions, while all other rows are outputs.

Ablative Studies and Performance

Comparative analyses with prior methods and user studies establish ControlNet's superior ability to generate high-quality images with strong conditional fidelity. User rankings and empirical evaluations underscore its effectiveness against industry-trained models, maintaining competitive performance despite reduced computational resources.

Figure 7: Ablative study of different architectures on a sketch condition and different prompt settings.

Figure 8: Comparison to previous methods. We present qualitative comparisons to PITI and Sketch-Guided Diffusion.

Dataset and Transferability

Experimentation with varying dataset sizes depicts ControlNet's scalability and minimal requirements for quality training. Moreover, its architecture facilitates direct transfer to community models, enhancing practical applicability.

Figure 9: The influence of different training dataset sizes.

Figure 10: Transfer pretrained ControlNets to community models without retraining.

Conclusion

ControlNet significantly broadens the capabilities of text-to-image diffusion models by enabling spatial conditional controls, maintaining large pretrained networks' integrity while efficiently learning diverse conditions. Its robust architecture ensures high-quality outputs and seamless integration into existing models, promising expansive applications in controlled image generation.

This research contributes a versatile tool for fine-tuned image creation, with implications for more precise semantic content generation, potentially fostering further innovations in AI-driven visual synthesis.