One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models

Published 28 Oct 2024 in cs.LG, cs.AI, and cs.CV | (2410.22366v4)

Abstract: For LLMs, sparse autoencoders (SAEs) have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigate the possibility of using SAEs to learn interpretable features for SDXL Turbo, a few-step text-to-image diffusion model. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net in its 1-step setting. Interestingly, we find that they generalize to 4-step SDXL Turbo and even to the multi-step SDXL base model (i.e., a different model) without additional training. In addition, we show that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. We do so by creating RIEBench, a representation-based image editing benchmark, for editing images while they are generated by turning on and off individual SAE features. This allows us to track which transformer blocks' features are the most impactful depending on the edit category. Our work is the first investigation of SAEs for interpretability in text-to-image diffusion models and our results establish SAEs as a promising approach for understanding and manipulating the internal mechanisms of text-to-image models.

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper employs Sparse Autoencoders to analyze SDXL Turbo, unveiling how distinct transformer blocks specialize in image composition, detail enhancement, and style infusion.
The study rigorously combines qualitative and quantitative methods with over 1.5 million LAION-COCO prompts to validate the interpretability of the model's internal features.
The paper’s methodology offers a practical framework for refining text-to-image models and advancing mechanistic interpretability in diffusion-based generative systems.

Interpreting SDXL Turbo Using Sparse Autoencoders: Insights on Text-to-Image Models

The paper presents an innovative study on understanding the intermediate representations of modern text-to-image generative models, specifically focusing on SDXL Turbo, a recent open-source few-step text-to-image diffusion model. This research employs Sparse Autoencoders (SAEs) to gain insight into the operations of SDXL Turbo's denoising U-net, with a particular emphasis on interpreting feature learning within the model's transformer blocks.

Methodology and Analysis

To explore whether SAEs can elucidate the computation performed during SDXL Turbo's generation process, the study trains SAEs on transformer block updates within the model. Using the SDLens library, the authors cache and manipulate SDXL Turbo’s intermediate results, creating a dataset with over 1.5 million prompts from the LAION-COCO dataset. Each transformer block's dense feature maps are collected and used to train multiple SAEs. The paper reports a detailed analysis of the learned features, employing both qualitative and quantitative methods.

The empirical analysis demonstrates that SAEs have the potential to learn interpretable features within diffusion-based text-to-image models. Visualization techniques are developed to showcase the interpretability and causal effects of the SAE-learned features across various transformer blocks. Notably, different blocks in the SDXL Turbo pipeline specialize in discrete aspects of image generation: image composition, local detail addition, and style, among others.

Quantitative Validation

Quantitative experiments were performed to confirm the qualitative findings on a larger dataset, demonstrating the robustness of the hypotheses. An automatic feature annotation pipeline was developed for the transformer block deemed responsible for image compositions. This approach highlighted the efficacy of SAEs in endowing researchers with a tool for understanding the computational intricacies of SDXL Turbo’s forward pass.

Theoretical and Practical Implications

From a theoretical standpoint, this work advances the field of mechanistic interpretability by exploring the less-explored domain of diffusion models. The successful application of SAEs, originally a tool developed for LLMs to decompose internal representations into interpretable features, to image generation models marks an essential step forward. Practically, the insights gained from this research can aid in refining text-to-image pipelines for various applications, potentially enhancing the precision and control over the generated images.

Future Directions

The open-sourcing of both the SAEs and the SDLens library provides a solid foundation for further research. Future studies might benefit from exploring deeper interactions between features within and across blocks, as well as leveraging advanced interpretability techniques, such as circuit discovery, to unravel the higher-order relations within the computational process.

Complex visual features that necessitate particular contexts for effect realization add a layer of challenge that contemporary visual LLMs struggle to annotate adequately. Thus, research aimed at improving annotation techniques, possibly through extended prompt and feature space exploration, could provide additional benefits in understanding and controlling text-to-image generative models.

In conclusion, the paper makes significant strides toward demystifying the operation of text-to-image models, employing SAEs to extract interpretable and causally relevant features, thereby offering the research community a pathway to deeper understanding and innovation in AI-driven generative technologies.