Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

Published 25 Mar 2024 in cs.CV, cs.AI, cs.GR, and cs.LG | (2403.16990v1)

Abstract: Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.

Abstract PDF HTML Upgrade to Chat

References (40)

Citations (13)

View on Semantic Scholar

Summary

The paper introduces a training-free 'Bounded Attention' method that mitigates semantic leakage in multi-subject text-to-image generation.
The method employs bounded guidance and subject-specific attention masks during the denoising process to preserve the individuality of similar subjects.
Experiments on Stable Diffusion and SDXL demonstrate improved generation accuracy, offering enhanced control for precise image synthesis.

Bounded Attention for Multi-Subject Text-to-Image Generation

The paper proposes a novel approach to address the challenges faced by existing text-to-image diffusion models in generating scenes with multiple, semantically or visually similar subjects. The authors identify a phenomenon termed "semantic leakage," wherein attention layers in the diffusion models inadvertently blend features between distinct subjects during the denoising process. This blending interferes with the model's ability to generate images that faithfully represent given complex prompts.

Methodology

The central contribution of the paper is the introduction of "Bounded Attention," a training-free method aimed at constraining the information flow in these generative models. Bounded Attention operates by modifying the attention computation to mitigate feature leakage, thereby enabling better control over the individuality of each subject in the generated image.

The approach is divided into two phases:

Bounded Guidance: During the initial denoising steps, a guidance loss is applied that steers cross- and self-attention maps to align with the intended subject layouts. The loss is strategically designed to guide the latent representation towards an accurate positioning of subjects without aggressive mask constraints.
Bounded Denoising: Throughout the entire denoising process, subject-specific attention masks are applied to both cross- and self-attention layers, preventing unwanted information leakage between subjects while still allowing interaction with the background to maintain image consistency.

The method is validated on both Stable Diffusion and SDXL diffusion models, showcasing its effectiveness compared to existing layout-guided generation methods.

Results and Implications

Experiments demonstrate that Bounded Attention significantly reduces semantic leakage, allowing for accurate generation of multiple subjects with distinct attributes even in scenarios where subjects share visual similarity. This is achieved without any retraining or fine-tuning, offering an efficient solution applicable to pre-existing models.

In terms of quantitative performance, the approach exhibits strong results on tasks involving complex prompt-based generation, outperforming state-of-the-art methods in both trained and training-free categories.

Practically, this technique enhances user control in applications demanding precise image synthesis from textual descriptions. Theoretically, it opens new avenues for research into attention mechanisms and their role in multi-subject generative tasks.

Future Directions

The paper lays the groundwork for further exploration into automatic seed generation aligned with complex prompts and investigating more advanced segmentation techniques during the denoising stages. Additionally, the method may be extended to other generative frameworks that rely heavily on attention mechanisms, contributing to a broader understanding of feature alignment in high-fidelity image synthesis.

In summary, Bounded Attention provides a robust framework for improving multi-subject text-to-image generation by addressing intrinsic architectural biases in diffusion models. This work not only advances the practical capabilities of generative models but also deepens the theoretical understanding of their operational dynamics.