Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models

Published 23 May 2023 in cs.CV | (2305.13921v2)

Abstract: Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, they fail to semantically align the generated images with the prompts due to their limited compositional capabilities, leading to attribute leakage, entity leakage, and missing entities. In this paper, we propose a novel attention mask control strategy based on predicted object boxes to address these issues. In particular, we first train a BoxNet to predict a box for each entity that possesses the attribute specified in the prompt. Then, depending on the predicted boxes, a unique mask control is applied to the cross- and self-attention maps. Our approach produces a more semantically accurate synthesis by constraining the attention regions of each token in the prompt to the image. In addition, the proposed method is straightforward and effective and can be readily integrated into existing cross-attention-based T2I generators. We compare our approach to competing methods and demonstrate that it can faithfully convey the semantics of the original text to the generated content and achieve high availability as a ready-to-use plugin. Please refer to https://github.com/OPPOMente-Lab/attention-mask-control.

Abstract PDF HTML Upgrade to Chat

References (34)

Citations (54)

View on Semantic Scholar

Summary

The paper introduces a two-stage control mechanism combining BoxNet and attention mask control to address compositional fidelity issues in T2I diffusion models.
It leverages predicted object boxes and unique binary masks to enforce spatial constraints, effectively mitigating attribute and entity leakage.
Experimental results demonstrate significant improvements, including an object score increase from 0.3973 to 0.6028 on COCO datasets, validating enhanced compositional control.

Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models

Introduction and Motivation

Text-to-image (T2I) diffusion models, such as Stable Diffusion (SD), have achieved strong results in generating high-quality images from textual prompts. However, these models exhibit notable deficiencies in compositional fidelity, particularly when prompts describe multiple entities with distinct attributes. Typical failure modes include attribute leakage (attributes of one entity incorrectly appearing on another), entity leakage (overlapping or duplicated entities), and missing entities (failure to generate all described objects). The root cause is traced to inaccurate and overlapping attention regions in the cross- and self-attention maps of the diffusion model's U-Net, which lack explicit constraints to enforce semantic alignment between text tokens and image regions.

Methodology

The proposed approach introduces a two-stage control mechanism for compositional T2I synthesis: (1) BoxNet, an object box prediction module, and (2) attention mask control over the diffusion model's attention maps.

BoxNet: Step-wise Object Box Prediction

BoxNet is trained to predict bounding boxes for each entity-attribute pair parsed from the input prompt at every denoising step of the diffusion process. The architecture comprises a frozen SD U-Net and text encoder, followed by an encoder-decoder transformer. The U-Net extracts multi-scale features from the noisy latent, which are concatenated and linearly projected. The transformer decoder receives entity queries (embeddings of parsed entity-attribute phrases) and outputs bounding box predictions. Training is performed on the COCO dataset, using a bipartite matching loss with a strong penalty for class mismatches and a combination of $L_1$ and generalized IoU losses for box regression.

Attention Mask Control

Predicted boxes are converted into unique binary masks using a 2D Gaussian-based assignment, ensuring non-overlapping regions of interest for each entity. These masks are then used to constrain both cross-attention and self-attention maps in the U-Net at every denoising step:

Cross-attention control: For each entity's token indices, the corresponding columns in the cross-attention map are masked to only attend to the assigned spatial region.
Self-attention control: For each entity, the self-attention map is masked such that only pixels within the entity's region attend to each other, promoting spatial coherence.

This explicit masking enforces a one-to-one mapping between text entities and image regions, directly addressing attribute and entity leakage.

(Figure 1)

Figure 1: Overview of the BoxNet-based T2I generation pipeline, showing the integration of box prediction and attention mask control into the diffusion process.

Implementation and Integration

The method is implemented as a plugin that can be incorporated into any cross-attention-based T2I diffusion model without fine-tuning the base model. The BoxNet module is trained once and then used to guide inference by dynamically predicting boxes and applying mask control at each denoising step. The approach is compatible with SD and its variants, such as Attend-and-Excite (AAE) and GLIGEN, and can be combined with other control strategies (e.g., AAE's gradient-based subject activation).

Experimental Results

Qualitative Analysis

The method demonstrates substantial improvements in compositional fidelity across both COCO and open-domain (NON-COCO) datasets. Generated images exhibit correct entity-attribute bindings, absence of spurious or missing entities, and improved spatial separation between objects. In complex prompts with multiple entities and attributes, the approach consistently outperforms SD, StructureDiffusion, and AAE, as evidenced by visual comparisons.

Quantitative Evaluation

Multiple metrics are used to assess performance:

Minimum Object Score (Grounding DINO): Measures the lowest object detection score for entities in the prompt, directly reflecting entity presence and localization. The proposed method achieves a significant increase (e.g., from 0.3973 to 0.6028 on COCO) over baselines.
Subjective Fidelity Score (User Study): Human annotators confirm higher rates of correct entity-attribute generation.
FID: Image quality is maintained or slightly improved compared to SD, indicating that compositional control does not degrade visual fidelity.
CLIP-based metrics: While included for completeness, these are shown to be less sensitive to attribute binding errors.

Ablation studies confirm that both unique mask assignment and self-attention control are critical for optimal performance. The method generalizes well to open-domain entities not seen during BoxNet training.

Practical Implications

The approach provides a practical, modular solution for improving compositional accuracy in T2I diffusion models. It does not require user-supplied layouts or bounding boxes, instead inferring spatial constraints from text alone. The plugin design allows for straightforward integration into existing pipelines and can be combined with other control methods. The method is particularly valuable for applications requiring precise multi-object generation, such as data augmentation, content creation, and visual grounding tasks.

Figure 2: The UI interface of the image annotation tool used for user studies, enabling efficient evaluation of entity-attribute fidelity.

Limitations and Future Directions

While the method substantially improves compositional fidelity, some limitations remain. In rare cases, image quality may degrade due to aggressive spatial masking, especially in highly cluttered scenes. The reliance on accurate text parsing and box prediction introduces potential failure points. Future work could explore joint training of BoxNet and the diffusion model, adaptive mask softening, and integration with more advanced language understanding modules to handle complex relational prompts.

Conclusion

This work presents a principled approach to compositional T2I synthesis by introducing explicit spatial constraints via attention mask control, guided by dynamically predicted object boxes. The method effectively mitigates attribute leakage, entity leakage, and missing entities, and is readily deployable as a plugin for existing diffusion-based T2I generators. The results establish a new standard for compositional fidelity in text-to-image synthesis and open avenues for further research in controllable generative modeling.

Markdown Report Issue