Development and Enhancement of Text-to-Image Diffusion Models

Published 7 Mar 2025 in cs.CV and cs.AI | (2503.05149v1)

Abstract: This research focuses on the development and enhancement of text-to-image denoising diffusion models, addressing key challenges such as limited sample diversity and training instability. By incorporating Classifier-Free Guidance (CFG) and Exponential Moving Average (EMA) techniques, this study significantly improves image quality, diversity, and stability. Utilizing Hugging Face's state-of-the-art text-to-image generation model, the proposed enhancements establish new benchmarks in generative AI. This work explores the underlying principles of diffusion models, implements advanced strategies to overcome existing limitations, and presents a comprehensive evaluation of the improvements achieved. Results demonstrate substantial progress in generating stable, diverse, and high-quality images from textual descriptions, advancing the field of generative artificial intelligence and providing new foundations for future applications. Keywords: Text-to-image, Diffusion model, Classifier-free guidance, Exponential moving average, Image generation.

Abstract PDF Upgrade to Chat

Summary

The paper explores enhancements to text-to-image diffusion models aimed at improving sample diversity and training stability.
Key methodologies include integrating Classifier-Free Guidance (CFG) to enhance image quality via conditional/unconditional text embeddings and employing Exponential Moving Average (EMA) for stable training.
Quantitative results show the enhanced model achieved a Fréchet Inception Distance (FID) of 1088.94, a significant improvement over the baseline's 1332.33, indicating better generated image quality.

This paper explores enhancements to text-to-image diffusion models, specifically addressing limitations in sample diversity and training instability.

The study integrates Classifier-Free Guidance (CFG) to improve image quality by conditioning the model on both conditional and unconditional text embeddings, using a configurable scale factor $w$ to adjust noise predictions $\hat{\epsilon}$ as represented by the equation $\hat{\epsilon} = \epsilon_\theta(x_t, t, c) + w(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))$ , where $\epsilon_\theta$ is the noise prediction model, $x_t$ is the noisy image at time step $t$ , $c$ is conditional text, and $\emptyset$ is unconditional text.
Exponential Moving Average (EMA) is employed to stabilize the training process, updating EMA parameters $\theta_{EMA}$ after each training step according to the formula $\theta_{EMA} = \alpha\theta_{EMA} + (1 - \alpha)\theta$ , where $\alpha$ is the decay factor and $\theta$ represents the current model parameter.
Quantitative results using Fréchet Inception Distance (FID) demonstrate that the enhanced model achieves a score of 1088.94, a marked improvement over the baseline model's score of 1332.33, indicating enhanced image quality and realism, though the high FID scores for both models are attributed to the discrepancy between real images and creative prompts.