PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity

Published 10 Mar 2025 in cs.LG and cs.AI | (2503.07677v2)

Abstract: Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS, which boosts pre-trained models (U-Net/Transformer) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention, our PLADIS unleashes the latent potential of text-to-image diffusion models, enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models. Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution. See Our project page : https://cubeyoung.github.io/pladis-proejct/

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces PLADIS, an inference-time method that integrates sparse attention into the cross-attention layers of pre-trained diffusion models using sparse Hopfield Networks to enhance performance.
PLADIS modifies the standard inference pathway to combine dense and sparse computations within a single pass, resulting in improved FID, CLIPScore, and human preference scores on models like SDXL without additional training or computation.
This method is compatible with existing guidance strategies and distilled models, demonstrating enhanced text alignment and image quality efficiently, making sparse attention a practical approach for diffusion model inference.

Overview

PLADIS is an inference-time method designed to augment pre-trained diffusion models (notably U-Net and Transformer backbones) by integrating sparse attention mechanisms into the cross-attention layers. The method capitalizes on the noise robustness of sparse Hopfield Networks (SHN), using the $-Entmax$ function—which encompasses the principles of softmax and sparsemax—to generate sparse query-key correlations. PLADIS is designed to be fully compatible with existing guidance strategies such as Classifier-Free Guidance (CFG), Perturbed Attention Guidance (PAG), and Smooth Energy Guidance (SEG), as well as with guidance-distilled models, without incurring additional training or requiring extra neural function evaluations (NFEs).

Methodology

PLADIS introduces sparse attention into the cross-attention modules of diffusion models by computing both dense and sparse components during inference. Key technical components include:

Sparse Attention Extrapolation: The method leverages a parameterized combination of softmax and its sparse counterpart for cross-attention. A scaling factor, $\lambda$ , is employed to weigh the difference between the dense and sparse computations.
No Additional Computation: By modifying the inference pathway within the standard cross-attention modules, PLADIS avoids extra NFEs or retraining, thereby making it applicable to a wide range of diffusion architectures.
Theoretical Underpinnings: Error bounds are provided to justify the noise-resistant characteristics of sparse attention relative to dense attention, resulting in lower retrieval errors, which is particularly significant given the inherent noisy nature of diffusion processes.

The mechanism is seamlessly integrated into the cross-attention layers, ensuring that both the dense and sparse attention computations are incorporated within a single forward pass, thereby preserving inference efficiency.

Experimental Setup

The evaluation of PLADIS is conducted using several prominent diffusion backbones and datasets:

Backbones: Primary evaluations are carried out with Stable Diffusion XL (SDXL). Additional experiments involve Stable Diffusion 1.5 (SD1.5) and the SANA model.
Datasets: The method is benchmarked on the MS-COCO validation set, Drawbench, HPD, and Pick-a-Pic.
Metrics: Visual fidelity is quantified using the Frechet Inception Distance (FID); text-image alignment is assessed via CLIPScore; and additional user-centric evaluations include ImageReward, PickScore, and Human Preference Score (HPS v2.1).
Implementation Details: Experiments utilize a single NVIDIA H100 GPU with the baseline hyper-parameters set to $\alpha = 1.5$ and $\lambda = 2.0$ .

Results

PLADIS demonstrates consistent improvements across multiple metrics:

FID and CLIPScore: Quantitative evaluations indicate notable improvements in FID and CLIPScore for a variety of guidance techniques. This is particularly evident when PLADIS is applied in conjunction with CFG, PAG, and SEG.
Guidance-Distilled Models: Performance enhancements are observed for guidance-distilled variants such as SDXL-Turbo, SDXL-Lightning, DMD2, and Hyper-SDXL, even when reducing the number of sampling steps to as few as four.
User Studies: Empirical studies report improved text alignment and enhanced human preference scores, affirming PLADIS's efficacy in generating visually coherent and text-faithful images.

The results clearly underscore the advantages of leveraging sparse attention, with significant improvements in both objective metrics and subjective evaluations.

Conclusions

PLADIS effectively extends the latent capacity of text-to-image diffusion models by integrating sparse attention into cross-attention modules. Without necessitating additional training or computational overhead during inference, the method delivers enhanced text alignment and image quality. Its compatibility with existing guidance methods and guidance-distilled frameworks, alongside robust numerical improvements in metrics like FID, CLIPScore, and HPS, accentuates the practical benefits of adopting sparse attention in diffusion-based image generation.

In summary, PLADIS leverages the theoretical advantages of sparse attention along with a pragmatic implementation approach to enhance diffusion models at inference time, yielding measurable improvements in both quantitative and qualitative benchmarks.

Markdown Report Issue