Papers
Topics
Authors
Recent
Search
2000 character limit reached

Development and Enhancement of Text-to-Image Diffusion Models

Published 7 Mar 2025 in cs.CV and cs.AI | (2503.05149v1)

Abstract: This research focuses on the development and enhancement of text-to-image denoising diffusion models, addressing key challenges such as limited sample diversity and training instability. By incorporating Classifier-Free Guidance (CFG) and Exponential Moving Average (EMA) techniques, this study significantly improves image quality, diversity, and stability. Utilizing Hugging Face's state-of-the-art text-to-image generation model, the proposed enhancements establish new benchmarks in generative AI. This work explores the underlying principles of diffusion models, implements advanced strategies to overcome existing limitations, and presents a comprehensive evaluation of the improvements achieved. Results demonstrate substantial progress in generating stable, diverse, and high-quality images from textual descriptions, advancing the field of generative artificial intelligence and providing new foundations for future applications. Keywords: Text-to-image, Diffusion model, Classifier-free guidance, Exponential moving average, Image generation.

Summary

  • The paper explores enhancements to text-to-image diffusion models aimed at improving sample diversity and training stability.
  • Key methodologies include integrating Classifier-Free Guidance (CFG) to enhance image quality via conditional/unconditional text embeddings and employing Exponential Moving Average (EMA) for stable training.
  • Quantitative results show the enhanced model achieved a Fréchet Inception Distance (FID) of 1088.94, a significant improvement over the baseline's 1332.33, indicating better generated image quality.

This paper explores enhancements to text-to-image diffusion models, specifically addressing limitations in sample diversity and training instability.

  • The study integrates Classifier-Free Guidance (CFG) to improve image quality by conditioning the model on both conditional and unconditional text embeddings, using a configurable scale factor ww to adjust noise predictions ϵ^\hat{\epsilon} as represented by the equation ϵ^=ϵθ(xt,t,c)+w(ϵθ(xt,t,c)−ϵθ(xt,t,∅))\hat{\epsilon} = \epsilon_\theta(x_t, t, c) + w(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset)), where ϵθ\epsilon_\theta is the noise prediction model, xtx_t is the noisy image at time step tt, cc is conditional text, and ∅\emptyset is unconditional text.
  • Exponential Moving Average (EMA) is employed to stabilize the training process, updating EMA parameters θEMA\theta_{EMA} after each training step according to the formula θEMA=αθEMA+(1−α)θ\theta_{EMA} = \alpha\theta_{EMA} + (1 - \alpha)\theta, where α\alpha is the decay factor and θ\theta represents the current model parameter.
  • Quantitative results using Fréchet Inception Distance (FID) demonstrate that the enhanced model achieves a score of 1088.94, a marked improvement over the baseline model's score of 1332.33, indicating enhanced image quality and realism, though the high FID scores for both models are attributed to the discrepancy between real images and creative prompts.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.