FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

Published 17 May 2023 in cs.CV | (2305.10431v2)

Abstract: Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend features among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300$\times$-2500$\times$ speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available at https://github.com/mit-han-lab/fastcomposer.

Abstract PDF Upgrade to Chat

Citations (180)

View on Semantic Scholar

Summary

The paper introduces a tuning-free model using a vision encoder to extract subject embeddings, eliminating the need for fine-tuning in personalized image generation.
It employs localized cross-attention supervision to mitigate identity blending, ensuring distinct and accurate representations for each subject.
The method achieves a 300×–2500× speedup over fine-tuning approaches, offering efficient and scalable image synthesis for diverse applications.

FastComposer: Tuning-Free Multi-Subject Image Generation

The paper "FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention" addresses significant challenges in the domain of text-to-image generation, particularly those associated with personalized, subject-driven outputs. The authors propose a novel method, FastComposer, that eliminates the need for subject-specific fine-tuning, offering a solution for efficient multi-subject image generation without compromising identity preservation or image quality.

Overview of FastComposer

FastComposer introduces a tuning-free approach to subject-driven image generation by employing a vision encoder to extract subject embeddings from reference images. These embeddings augment text conditionings in diffusion models, facilitating efficient image generation through simple forward passes. This method bypasses the computational intensity of fine-tuning, making it suitable for deployment across a range of platforms, including edge devices.

The key innovation within FastComposer is its mechanism to address the persistent issue of identity blending in multi-subject generation. By leveraging cross-attention localization supervision during training, the model ensures that each subject's identity remains distinct even in compositions involving multiple references. Furthermore, delayed subject conditioning in the denoising process maintains a delicate balance between subject identity and image editability.

Methodology and Results

The methodology section highlights the integration of a pre-trained CLIP image encoder with an MLP to enhance text conditionings based on extracted visual features. The training process involves a subject-augmented image-text paired dataset, which includes noun phrases aligned with image segments through segmentation and dependency parsing models.

To mitigate identity blending, the authors introduce a cross-attention regularization technique, guiding attention maps to distinct subject regions through segmentation masks. This localization is crucial for preserving identity in multi-subject scenarios, as evidenced by visual and quantitative assessments presented in the paper.

FastComposer demonstrates a remarkable speedup of 300×–2500× over fine-tuning-based methods and requires no additional storage for new subjects. Its performance in terms of identity preservation and prompt consistency outstrips existing methods, as evidenced by evaluations involving various subject and prompt combinations.

Implications and Future Directions

The implications of FastComposer extend both practically and theoretically. Practically, the method's efficiency and scalability enable widespread adoption in applications requiring personalized content creation, such as digital art or personalized marketing. Theoretically, it offers a new perspective on tuning-free methodologies in AI, which can inspire further research in efficient model deployment strategies.

Looking forward, expanding the method's applicability to non-human subjects and integrating it with more diverse datasets could enhance its versatility. Advances in content moderation and ethical guidelines will be crucial to counter potential misuse, such as fabricating deepfake content, ensuring the responsible use of such generative technologies.

In conclusion, FastComposer introduces a significant advancement in multi-subject image generation by circumventing the limitations of existing fine-tuning-dependent methods, demonstrating both the potential and challenges of AI-driven personalization in creative processes.