BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

Published 24 May 2023 in cs.CV and cs.AI | (2305.14720v2)

Abstract: Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Code and models will be released at https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion. Project page at https://dxli94.github.io/BLIP-Diffusion-website/.

Abstract PDF HTML Upgrade to Chat

References (39)

Citations (219)

View on Semantic Scholar

Summary

The paper presents BLIP-Diffusion, which leverages a pre-trained multimodal encoder for efficient, subject-specific text-to-image generation and editing.
It employs a two-stage pre-training strategy that aligns image features with textual prompts, reducing fine-tuning steps up to 20x compared to earlier methods.
The method integrates with frameworks like ControlNet to improve subject fidelity and controllability in generated images for advanced editing applications.

BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

BLIP-Diffusion introduces an innovative approach in the domain of text-to-image generation by embedding subject-driven generation capabilities through a pre-trained multimodal encoder. Building upon the architecture of BLIP-2 and a latent diffusion model, it facilitates efficient zero-shot or few-step fine-tuning processes. This document dissects the methodology and implications of the BLIP-Diffusion model, focusing on its application in controllable generation and editing.

Introduction to BLIP-Diffusion

The primary distinction of BLIP-Diffusion lies in its subject-driven approach enabled through a robust multimodal encoder. Traditional methods like DreamBooth and Textual Inversion depend heavily on extensive fine-tuning to adapt to new subjects, which restricts scalability and efficiency. The BLIP-Diffusion model addresses these constraints by employing a pre-trained subject representation that can align visual and textual inputs seamlessly.

Figure 1: Leveraging the pre-trained subject representation, BLIP-Diffusion enables subject-driven generation under efficient fine-tuning or zero-shot setups.

Unlike standard text-to-image models, it integrates image and text as control inputs, enhancing the model's ability to maintain subject fidelity during novel renditions. This integration is achieved through a two-stage pre-training strategy that first aligns image features with text and then enables the diffusion model to generate new subject-specific images.

Methodology

Pre-Training Strategy

BLIP-Diffusion's pre-training involves two critical stages: multimodal representation learning and subject representation learning.

Multimodal Representation Learning: This stage helps align image features with textual prompts by leveraging BLIP-2's vision-language encoder. By outputting text-aligned image features, the model can embed subject details into a diffusion framework.
Subject Representation Learning: Here, the model generates novel renditions by processing input images with random backgrounds and target images. This setup encourages the model to separate subject and context, allowing flexible subject representation in generation processes.
Figure 2: Illustration of the two-staged pre-training for BLIP-Diffusion.

Multimodal Encoder Integration

BLIP-Diffusion employs BLIP-2's encoder to derive visual features aligned with text prompts, supplemented by a subject representation learning task. This task synthesizes subject input images, leading to a robust framework for capturing subject-specific appearance while maintaining the diffusion model's capabilities.

Fine-Tuning and Inference

The model supports both zero-shot generation and efficient fine-tuning, significantly reducing the computational overhead required by earlier methods. The fine-tuning process leverages cached subject embeddings, focusing computational resources on refining image outputs rather than recalibrating the model from scratch.

Figure 3: \smallLeft: example training image pairs, target images (top) and input images (bottom) with random background.

Applications

BLIP-Diffusion extends beyond basic image synthesis, facilitating advanced applications such as controllable generation and intelligent image editing. It incorporates established frameworks like ControlNet and prompt-to-prompt, enhancing image structure control and subject-specific editing without necessitating retraining.

Figure 4: \smallLeft: BLIP-Diffusion cooperates with ControlNet for structure and subject controllable generation.

Experimental Results

In comparative evaluations, BLIP-Diffusion demonstrates superior subject fidelity and prompt adherence with up to 20x efficiency in fine-tuning compared to existing benchmarks like DreamBooth. Its architecture can handle diverse subjects more efficiently, proving particularly effective in handling generalized subject categories.

Figure 5: Qualitative results categorized by generative capabilities.

Limitations and Future Directions

Despite its strengths, BLIP-Diffusion inherits some drawbacks from its underlying diffusion mechanisms, including occasional misinterpretations of text prompts and compositional nuances. Ongoing developments in diffusion model architectures can potentially address these issues.

Conclusion

BLIP-Diffusion emerges as a versatile model in the landscape of text-to-image generation, combining high-fidelity subject representation with controlled generative capabilities. As foundational diffusion models evolve, BLIP-Diffusion offers a scalable path forward for precision-driven image generation tasks across varied domains.