Fast-Forward Caching (FORA) for Diffusion Models
- Fast-Forward Caching (FORA) is a training-free, plug-and-play technique for diffusion transformers that caches intermediate outputs to eliminate redundant computations.
- It exploits high feature similarity across reverse diffusion steps to achieve up to 80% FLOP reduction while managing quality trade-offs.
- The method integrates seamlessly with models like DiT and PixArt-α, enabling adjustable speed–quality balances through a configurable caching interval.
Fast-Forward Caching (FORA) is a training-free, plug-and-play acceleration technique specifically designed for transformer-based diffusion models (notably, Diffusion Transformers or DiTs). FORA’s core innovation is the exploitation of high feature redundancy between successive time steps in the reverse diffusion process, achieved by caching and reusing the outputs of computationally expensive self-attention and MLP layers across denoising steps. This mechanism yields substantial reductions in inference time and computational load, with negligible impact on sample quality under appropriately chosen caching intervals. FORA is compatible with widely used architectures such as DiT and PixArt-α and integrates seamlessly with typical diffusion samplers, requiring no retraining or model modification (Selvaraju et al., 2024).
1. Caching Mechanism and Theoretical Basis
FORA leverages the empirical observation that intermediate features within attention and MLP layers—specifically, the tensors and at time step and layer —exhibit high similarity across adjacent reverse diffusion steps. Formally, let denote the total denoising steps, the number of transformer blocks, the sequence length, and the embedding dimension.
For each layer , at each time , the computations are:
- ,
- .
FORA introduces layer-wise caches:
- cache[Attn] []
- cache[MLP] []
With caching interval , the forward pass is computed in full at steps where (including ). During the subsequent steps, attention and MLP modules are skipped and their cached outputs are reused. This results in a substantial reduction of redundant operations, particularly in FLOPs from attention/MLP layers.
2. Algorithmic Implementation
The integration of FORA into the standard diffusion inference loop operates as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Input: pretrained DiT (L blocks), scheduler, total steps T, interval N Initialize: caches for attention and MLP features per layer x_T ← sample noise for t = T, T-1, ..., 1: if t % N == 0: h = x_t for k = 1 to L: h_attn = Attn_k(h) cache_attn[k] = h_attn h_mlp = MLP_k(h_attn) cache_mlp[k] = h_mlp h = h_mlp else: for k = 1 to L: h = cache_mlp[k] x_{t-1} = scheduler.step(h, x_t, t) return x_0 |
Computationally, the total cost per block without FORA is given by:
With FORA (static caching every steps):
The resulting reduction in FLOPs is approximately . For , this yields up to 80% computational savings per full sampling trajectory.
3. Integration with Transformer-Based Diffusion Models
FORA operates entirely at inference-time, requiring no modification of model weights or structure. As a plug-in wrapper, it is compatible with standard transformer-based diffusion models including DiT and PixArt-α, and functions with prevalent samplers such as DDIM, DPM-Solver, and IDDPM. The method enables acceleration across both class-conditional and text-conditional generative tasks.
The memory overhead introduced by FORA is determined by two cached tensors of shape per transformer layer, resulting in
In a DiT-XL/2 configuration (, , ), this amounts to approximately 56 MB, or 10% of a 600 MB model. Using half-precision reduces this further.
4. Empirical Performance and Trade-Offs
FORA’s efficacy is characterized by the speed–quality trade-off governed by the interval . On ImageNet with DiT-XL/2-G and 250 DDIM steps, results are as follows:
| Setting | Speedup | FID | IS |
|---|---|---|---|
| Baseline | 1× | 2.27 | 278.24 |
| FORA | 2.08× | 2.40 | 269.74 |
| FORA | 2.80× | 2.82 | 253.96 |
| FORA | 4.57× | 4.97 | 222.97 |
| FORA | 5.73× | 9.80 | 180.54 |
For text-conditional MS-COCO (PixArt-α, guidance 4.5):
| Baseline | Steps | FID | FORA Speedup | FORA FID |
|---|---|---|---|---|
| DPM-Solver | 20 | 28.15 | 1.54× | 38.20 |
| IDDPM | 100 | 43.54 | 1.86× | 55.30 |
Larger increases acceleration but results in progressive degradation of FID and visual detail. Empirically, for ImageNet, setting achieves optimal balance ( speed, FID = 2.82, IS = 253.96).
5. Computational and Memory Analysis
The computational complexity for the attention and MLP modules per block per step is . Static caching every steps reduces redundant FLOPs by , while the associated memory overhead remains a relatively minor fraction of typical modern model sizes. For example, caching overhead for DiT-XL/2 is approximately 56 MB in single-precision. When deployed in half-precision, this halves the auxiliary memory footprint, making the technique suitable for both large-scale servers and on-device applications.
6. Practical Considerations and Application Settings
FORA is adaptable to user requirements via selection of the interval . This permits explicit control over the speed–quality trade-off, enabling real-time or on-device diffusion inference under tight resource constraints. The approach applies broadly to both class-conditional and text-conditional generation tasks, and is compatible with typical sampling strategies in diffusion probabilistic models. FORA’s method—reusing cached outputs without model retraining—broadens its applicability to pre-trained diffusion transformers across various domains (Selvaraju et al., 2024).
7. Summary of Contributions
FORA’s main technical contributions are as follows: (1) introduction of static layerwise caching of attention and MLP outputs within the inference trajectory of transformer-based diffusion models; (2) compatibility as a training-free, easily integrated plug-in for mainstream DiT and PixArt-α architectures; (3) quantification of computational savings—up to speed-up with only modest impacts on FID and IS metrics, and (4) maintenance of manageable memory overhead that is small compared to overall model size. Empirical findings suggest that moderate cache intervals ( or $4$) suffice for substantial acceleration with high-fidelity output, making FORA a practical solution for the efficient deployment of high-performance diffusion transformers.