GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Published 20 Dec 2021 in cs.CV, cs.GR, and cs.LG | (2112.10741v3)

Abstract: Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.

Abstract PDF Upgrade to Chat

Citations (3,005)

View on Semantic Scholar

Summary

The paper demonstrates that classifier-free guidance in diffusion models produces photorealistic images with superior quality, even surpassing DALL-E in human evaluations.
The methodology leverages a noise-based diffusion process with fine-tuning for image inpainting, enabling seamless text-driven edits and iterative scene creation.
Quantitative metrics and human evaluations confirm its effectiveness, yielding competitive FID scores and improved image-caption similarity.

Text-Guided Photorealistic Image Synthesis with Diffusion Models

The paper "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models" (2112.10741) explores the use of diffusion models for text-conditional image synthesis, comparing CLIP guidance and classifier-free guidance. The authors demonstrate that classifier-free guidance yields higher-quality, more photorealistic images, even surpassing DALL-E in human evaluations. Furthermore, the paper introduces a fine-tuning approach for image inpainting, enabling text-driven image editing capabilities.

Background and Methods

The study builds on the advancements in diffusion models, which have demonstrated state-of-the-art image generation quality. The foundation of diffusion models lies in the progressive addition of Gaussian noise to an image, creating a Markov chain of latent variables $x_1, ..., x_T$ (where $x_0$ is the original image):

$q(x_t | x_{t-1}) \coloneqq \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, (1-\alpha_t)\mathcal{I})$

The model learns to reverse this process, gradually denoising from $x_T \sim \mathcal{N}(0, \mathcal{I})$ back to $x_0$ through $p_{\theta}(x_{t-1}|x_t) \coloneqq \mathcal{N}(\mu_{\theta}(x_t), \Sigma_{\theta}(x_t))$ . The training process involves predicting the noise added at each step using a model $\epsilon_{\theta}$ , optimizing the mean-squared error loss:

$L_{\text{simple} \coloneqq E_{t \sim [1,T],x_0 \sim q(x_0), \epsilon \sim \mathcal{N}(0, \mathbf{I})}[||\epsilon - \epsilon_{\theta}(x_t, t)||^2]$

The paper investigates two guidance strategies to steer the diffusion process toward text prompts: CLIP guidance and classifier-free guidance. CLIP guidance leverages a CLIP model to perturb the reverse-process mean, using the gradient of the dot product between image and caption encodings:

$\hat{\mu}_{\theta}(x_t|c) = \mu_{\theta}(x_t|c) + s \cdot \Sigma_{\theta}(x_t|c) \nabla_{x_t} \left(f(x_t) \cdot g(c) \right)$

Classifier-free guidance, on the other hand, avoids a separate classifier by replacing labels with a null label during training. During sampling, it extrapolates the model's output based on the difference between conditional and unconditional predictions:

$\hat{\epsilon}_{\theta}(x_t|c) = \epsilon_{\theta}(x_t|\emptyset) + s \cdot (\epsilon_{\theta}(x_t|c) - \epsilon_{\theta}(x_t|\emptyset))$

Figure 1: Selected samples from GLIDE using classifier-free guidance, illustrating its ability to generate photorealistic images with complex compositions and artistic renderings.

Experimental Setup and Results

The authors trained a 3.5 billion parameter text-conditional diffusion model at $64 \times 64$ resolution, along with a 1.5 billion parameter upsampling model to increase the resolution to $256 \times 256$ . They also trained a noised $64 \times 64$ ViT-L CLIP model for CLIP guidance. The models were trained on the same dataset as DALL-E. Fine-tuning was performed to support unconditional image generation and image inpainting, where random regions were erased during training.

Quantitative evaluations, including Precision/Recall, FID, Inception Score, and CLIP score, revealed that classifier-free guidance is nearly Pareto optimal in the FID vs. IS and Precision vs. Recall trade-offs. Human evaluations further confirmed the superiority of classifier-free guidance in both photorealism and caption similarity. GLIDE achieved competitive FID scores on MS-COCO without explicit training on this dataset and outperformed DALL-E in human evaluations, even when DALL-E used CLIP reranking.

Figure 2: Text-conditional image inpainting examples from GLIDE, showcasing its ability to seamlessly integrate new elements into existing images based on text prompts.

Text-Driven Image Editing with GLIDE

A significant contribution of the paper is the fine-tuning of GLIDE for image inpainting, enabling text-driven image editing. The model can realistically modify existing images based on text prompts, inserting new objects and generating convincing shadows and reflections. The paper also explores the integration of GLIDE with SDEdit, allowing users to combine sketches with text captions for controlled image modifications (Figure 3). Iterative inpainting using GLIDE facilitates the creation of complex scenes through a series of sequential edits (Figure 4).

Figure 4: Iterative scene creation using GLIDE, demonstrating its capability to build complex visual narratives through sequential text-guided inpainting operations.

Figure 3: Examples of text-conditional SDEdit, where GLIDE is used to transform sketches into realistic image edits based on textual descriptions.

Safety Considerations

The authors acknowledge the potential for misuse of their model, particularly in generating disinformation and Deepfakes. To mitigate these risks, they trained a smaller, filtered model (GLIDE (filtered)) on a dataset with images of people, violent content, and hate symbols removed. While this reduces the model's capabilities in certain areas, it retains inpainting functionality, which still poses potential misuse concerns.

Limitations

The paper identifies limitations in handling unusual objects or scenarios, as well as the relatively slow sampling speed of the unoptimized model (15 seconds per image on a single A100 GPU).

Conclusion

The paper demonstrates the effectiveness of diffusion models for text-conditional image synthesis and editing. The finding that classifier-free guidance outperforms CLIP guidance is significant, as it simplifies the training process and improves image quality. The text-driven image editing capabilities of GLIDE, particularly through fine-tuned inpainting, represent a substantial advancement in AI-assisted content creation. While the authors address safety concerns through data filtering, the potential for misuse remains an important area for future research and mitigation strategies.