Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models

Published 23 May 2023 in cs.CV | (2305.13921v2)

Abstract: Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, they fail to semantically align the generated images with the prompts due to their limited compositional capabilities, leading to attribute leakage, entity leakage, and missing entities. In this paper, we propose a novel attention mask control strategy based on predicted object boxes to address these issues. In particular, we first train a BoxNet to predict a box for each entity that possesses the attribute specified in the prompt. Then, depending on the predicted boxes, a unique mask control is applied to the cross- and self-attention maps. Our approach produces a more semantically accurate synthesis by constraining the attention regions of each token in the prompt to the image. In addition, the proposed method is straightforward and effective and can be readily integrated into existing cross-attention-based T2I generators. We compare our approach to competing methods and demonstrate that it can faithfully convey the semantics of the original text to the generated content and achieve high availability as a ready-to-use plugin. Please refer to https://github.com/OPPOMente-Lab/attention-mask-control.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324.
  2. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 213–229. Springer.
  3. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4): 1–10.
  4. Training-Free Layout Control with Cross-Attention Guidance. arXiv preprint arXiv:2304.03373.
  5. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  6. DOERSCH, C. 2021. Tutorial on Variational Autoencoders. stat, 1050: 3.
  7. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12873–12883.
  8. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. In The Eleventh International Conference on Learning Representations.
  9. Generative adversarial networks. Communications of the ACM, 63(11): 139–144.
  10. SVDiff: Compact Parameter Space for Diffusion Fine-Tuning. arXiv preprint arXiv:2303.11305.
  11. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
  12. spaCy: Industrial-strength Natural Language Processing in Python.
  13. Shape-Guided Diffusion with Inside-Outside Attention. arXiv e-prints, arXiv–2212.
  14. Jiménez, Á. B. 2023. Mixture of Diffusers for scene composition and high resolution image generation. arXiv preprint arXiv:2302.02412.
  15. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22511–22521.
  16. LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models. arXiv preprint arXiv:2305.13655.
  17. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
  18. Compositional visual generation with composable diffusion models. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII, 423–439. Springer.
  19. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499.
  20. Cones: Concept Neurons in Diffusion Models for Customized Generation. arXiv preprint arXiv:2303.05125.
  21. Directed Diffusion: Direct Control of Object Placement through Attention Guidance. arXiv preprint arXiv:2302.13153.
  22. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  23. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
  24. Zero-shot text-to-image generation. In International Conference on Machine Learning, 8821–8831. PMLR.
  25. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 658–666.
  26. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.
  27. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241. Springer.
  28. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 36479–36494.
  29. End-to-end people detection in crowded scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2325–2333.
  30. Attention is all you need. Advances in neural information processing systems, 30.
  31. Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7766–7776.
  32. Layouttransformer: Scene layout generation with conceptual and spatial diversity. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3732–3741.
  33. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4): 1–39.
  34. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3836–3847.
Citations (54)

Summary

  • The paper introduces a two-stage control mechanism combining BoxNet and attention mask control to address compositional fidelity issues in T2I diffusion models.
  • It leverages predicted object boxes and unique binary masks to enforce spatial constraints, effectively mitigating attribute and entity leakage.
  • Experimental results demonstrate significant improvements, including an object score increase from 0.3973 to 0.6028 on COCO datasets, validating enhanced compositional control.

Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models

Introduction and Motivation

Text-to-image (T2I) diffusion models, such as Stable Diffusion (SD), have achieved strong results in generating high-quality images from textual prompts. However, these models exhibit notable deficiencies in compositional fidelity, particularly when prompts describe multiple entities with distinct attributes. Typical failure modes include attribute leakage (attributes of one entity incorrectly appearing on another), entity leakage (overlapping or duplicated entities), and missing entities (failure to generate all described objects). The root cause is traced to inaccurate and overlapping attention regions in the cross- and self-attention maps of the diffusion model's U-Net, which lack explicit constraints to enforce semantic alignment between text tokens and image regions.

Methodology

The proposed approach introduces a two-stage control mechanism for compositional T2I synthesis: (1) BoxNet, an object box prediction module, and (2) attention mask control over the diffusion model's attention maps.

BoxNet: Step-wise Object Box Prediction

BoxNet is trained to predict bounding boxes for each entity-attribute pair parsed from the input prompt at every denoising step of the diffusion process. The architecture comprises a frozen SD U-Net and text encoder, followed by an encoder-decoder transformer. The U-Net extracts multi-scale features from the noisy latent, which are concatenated and linearly projected. The transformer decoder receives entity queries (embeddings of parsed entity-attribute phrases) and outputs bounding box predictions. Training is performed on the COCO dataset, using a bipartite matching loss with a strong penalty for class mismatches and a combination of L1L_1 and generalized IoU losses for box regression.

Attention Mask Control

Predicted boxes are converted into unique binary masks using a 2D Gaussian-based assignment, ensuring non-overlapping regions of interest for each entity. These masks are then used to constrain both cross-attention and self-attention maps in the U-Net at every denoising step:

  • Cross-attention control: For each entity's token indices, the corresponding columns in the cross-attention map are masked to only attend to the assigned spatial region.
  • Self-attention control: For each entity, the self-attention map is masked such that only pixels within the entity's region attend to each other, promoting spatial coherence.

This explicit masking enforces a one-to-one mapping between text entities and image regions, directly addressing attribute and entity leakage.

(Figure 1)

Figure 1: Overview of the BoxNet-based T2I generation pipeline, showing the integration of box prediction and attention mask control into the diffusion process.

Implementation and Integration

The method is implemented as a plugin that can be incorporated into any cross-attention-based T2I diffusion model without fine-tuning the base model. The BoxNet module is trained once and then used to guide inference by dynamically predicting boxes and applying mask control at each denoising step. The approach is compatible with SD and its variants, such as Attend-and-Excite (AAE) and GLIGEN, and can be combined with other control strategies (e.g., AAE's gradient-based subject activation).

Experimental Results

Qualitative Analysis

The method demonstrates substantial improvements in compositional fidelity across both COCO and open-domain (NON-COCO) datasets. Generated images exhibit correct entity-attribute bindings, absence of spurious or missing entities, and improved spatial separation between objects. In complex prompts with multiple entities and attributes, the approach consistently outperforms SD, StructureDiffusion, and AAE, as evidenced by visual comparisons.

Quantitative Evaluation

Multiple metrics are used to assess performance:

  • Minimum Object Score (Grounding DINO): Measures the lowest object detection score for entities in the prompt, directly reflecting entity presence and localization. The proposed method achieves a significant increase (e.g., from 0.3973 to 0.6028 on COCO) over baselines.
  • Subjective Fidelity Score (User Study): Human annotators confirm higher rates of correct entity-attribute generation.
  • FID: Image quality is maintained or slightly improved compared to SD, indicating that compositional control does not degrade visual fidelity.
  • CLIP-based metrics: While included for completeness, these are shown to be less sensitive to attribute binding errors.

Ablation studies confirm that both unique mask assignment and self-attention control are critical for optimal performance. The method generalizes well to open-domain entities not seen during BoxNet training.

Practical Implications

The approach provides a practical, modular solution for improving compositional accuracy in T2I diffusion models. It does not require user-supplied layouts or bounding boxes, instead inferring spatial constraints from text alone. The plugin design allows for straightforward integration into existing pipelines and can be combined with other control methods. The method is particularly valuable for applications requiring precise multi-object generation, such as data augmentation, content creation, and visual grounding tasks. Figure 2

Figure 2: The UI interface of the image annotation tool used for user studies, enabling efficient evaluation of entity-attribute fidelity.

Limitations and Future Directions

While the method substantially improves compositional fidelity, some limitations remain. In rare cases, image quality may degrade due to aggressive spatial masking, especially in highly cluttered scenes. The reliance on accurate text parsing and box prediction introduces potential failure points. Future work could explore joint training of BoxNet and the diffusion model, adaptive mask softening, and integration with more advanced language understanding modules to handle complex relational prompts.

Conclusion

This work presents a principled approach to compositional T2I synthesis by introducing explicit spatial constraints via attention mask control, guided by dynamically predicted object boxes. The method effectively mitigates attribute leakage, entity leakage, and missing entities, and is readily deployable as a plugin for existing diffusion-based T2I generators. The results establish a new standard for compositional fidelity in text-to-image synthesis and open avenues for further research in controllable generative modeling.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.