Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fast-Forward Caching (FORA) for Diffusion Models

Updated 27 November 2025
  • Fast-Forward Caching (FORA) is a training-free, plug-and-play technique for diffusion transformers that caches intermediate outputs to eliminate redundant computations.
  • It exploits high feature similarity across reverse diffusion steps to achieve up to 80% FLOP reduction while managing quality trade-offs.
  • The method integrates seamlessly with models like DiT and PixArt-α, enabling adjustable speed–quality balances through a configurable caching interval.

Fast-Forward Caching (FORA) is a training-free, plug-and-play acceleration technique specifically designed for transformer-based diffusion models (notably, Diffusion Transformers or DiTs). FORA’s core innovation is the exploitation of high feature redundancy between successive time steps in the reverse diffusion process, achieved by caching and reusing the outputs of computationally expensive self-attention and MLP layers across denoising steps. This mechanism yields substantial reductions in inference time and computational load, with negligible impact on sample quality under appropriately chosen caching intervals. FORA is compatible with widely used architectures such as DiT and PixArt-α and integrates seamlessly with typical diffusion samplers, requiring no retraining or model modification (Selvaraju et al., 2024).

1. Caching Mechanism and Theoretical Basis

FORA leverages the empirical observation that intermediate features within attention and MLP layers—specifically, the tensors ht,attn(k)\mathbf{h}^{(k)}_{t, \mathrm{attn}} and ht,mlp(k)\mathbf{h}^{(k)}_{t, \mathrm{mlp}} at time step tt and layer kk—exhibit high similarity across adjacent reverse diffusion steps. Formally, let TT denote the total denoising steps, LL the number of transformer blocks, SS the sequence length, and dd the embedding dimension.

For each layer kk, at each time tt, the computations are:

  • ht,attn(k)=Attnk(ht(k1))\mathbf{h}^{(k)}_{t, \mathrm{attn}} = \mathrm{Attn}_k(\mathbf{h}^{(k-1)}_t),
  • ht,mlp(k)=MLPk(ht,attn(k))\mathbf{h}^{(k)}_{t, \mathrm{mlp}} = \mathrm{MLP}_k(\mathbf{h}^{(k)}_{t, \mathrm{attn}}).

FORA introduces layer-wise caches:

  • cache[Attn] [kk] \leftarrow ht,attn(k)\mathbf{h}^{(k)}_{t, \mathrm{attn}}
  • cache[MLP] [kk] \leftarrow ht,mlp(k)\mathbf{h}^{(k)}_{t, \mathrm{mlp}}

With caching interval NN, the forward pass is computed in full at steps where tmodN=0t \bmod N = 0 (including t=Tt=T). During the N1N-1 subsequent steps, attention and MLP modules are skipped and their cached outputs are reused. This results in a substantial reduction of redundant operations, particularly in O(Sd2)\mathcal{O}(S d^2) FLOPs from attention/MLP layers.

2. Algorithmic Implementation

The integration of FORA into the standard diffusion inference loop operates as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Input: pretrained DiT (L blocks), scheduler, total steps T, interval N
Initialize: caches for attention and MLP features per layer

x_T  sample noise
for t = T, T-1, ..., 1:
    if t % N == 0:
        h = x_t
        for k = 1 to L:
            h_attn = Attn_k(h)
            cache_attn[k] = h_attn
            h_mlp  = MLP_k(h_attn)
            cache_mlp[k] = h_mlp
            h = h_mlp
    else:
        for k = 1 to L:
            h = cache_mlp[k]
    x_{t-1} = scheduler.step(h, x_t, t)
return x_0
(Selvaraju et al., 2024)

Computationally, the total cost per block without FORA is given by:

Corig=T×L×2Sd2C_{\mathrm{orig}} = T \times L \times 2 S d^2

With FORA (static caching every NN steps):

CFORA=TN×L×2Sd2+(TTN)×L×O(Sd)C_{\mathrm{FORA}} = \frac{T}{N} \times L \times 2 S d^2 + (T - \frac{T}{N}) \times L \times \mathcal{O}(S d)

The resulting reduction in FLOPs is approximately 11N1 - \frac{1}{N}. For N=5N=5, this yields up to 80% computational savings per full sampling trajectory.

3. Integration with Transformer-Based Diffusion Models

FORA operates entirely at inference-time, requiring no modification of model weights or structure. As a plug-in wrapper, it is compatible with standard transformer-based diffusion models including DiT and PixArt-α, and functions with prevalent samplers such as DDIM, DPM-Solver, and IDDPM. The method enables acceleration across both class-conditional and text-conditional generative tasks.

The memory overhead introduced by FORA is determined by two cached tensors of shape [S×d][S \times d] per transformer layer, resulting in

Mcache=2LSd×sizeof(float)M_{\mathrm{cache}} = 2 L S d \times \texttt{sizeof(float)}

In a DiT-XL/2 configuration (L=28L=28, S=256S=256, d=1024d=1024), this amounts to approximately 56 MB, or 10% of a 600 MB model. Using half-precision reduces this further.

4. Empirical Performance and Trade-Offs

FORA’s efficacy is characterized by the speed–quality trade-off governed by the interval NN. On ImageNet 2562256^2 with DiT-XL/2-G and 250 DDIM steps, results are as follows:

Setting Speedup FID IS
Baseline 2.27 278.24
FORA N=2N=2 2.08× 2.40 269.74
FORA N=3N=3 2.80× 2.82 253.96
FORA N=5N=5 4.57× 4.97 222.97
FORA N=7N=7 5.73× 9.80 180.54

For text-conditional MS-COCO 2562256^2 (PixArt-α, guidance 4.5):

Baseline Steps FID FORA N=2N=2 Speedup FORA N=2N=2 FID
DPM-Solver 20 28.15 1.54× 38.20
IDDPM 100 43.54 1.86× 55.30

Larger NN increases acceleration but results in progressive degradation of FID and visual detail. Empirically, for ImageNet, setting N=3N=3 achieves optimal balance (2.80×2.80\times speed, FID = 2.82, IS = 253.96).

5. Computational and Memory Analysis

The computational complexity for the attention and MLP modules per block per step is O(Sd2)\mathcal{O}(S d^2). Static caching every NN steps reduces redundant FLOPs by (11/N)(1 - 1/N), while the associated memory overhead remains a relatively minor fraction of typical modern model sizes. For example, caching overhead for DiT-XL/2 is approximately 56 MB in single-precision. When deployed in half-precision, this halves the auxiliary memory footprint, making the technique suitable for both large-scale servers and on-device applications.

6. Practical Considerations and Application Settings

FORA is adaptable to user requirements via selection of the interval NN. This permits explicit control over the speed–quality trade-off, enabling real-time or on-device diffusion inference under tight resource constraints. The approach applies broadly to both class-conditional and text-conditional generation tasks, and is compatible with typical sampling strategies in diffusion probabilistic models. FORA’s method—reusing cached outputs without model retraining—broadens its applicability to pre-trained diffusion transformers across various domains (Selvaraju et al., 2024).

7. Summary of Contributions

FORA’s main technical contributions are as follows: (1) introduction of static layerwise caching of attention and MLP outputs within the inference trajectory of transformer-based diffusion models; (2) compatibility as a training-free, easily integrated plug-in for mainstream DiT and PixArt-α architectures; (3) quantification of computational savings—up to 5×5\times speed-up with only modest impacts on FID and IS metrics, and (4) maintenance of manageable memory overhead that is small compared to overall model size. Empirical findings suggest that moderate cache intervals (N=3N=3 or $4$) suffice for substantial acceleration with high-fidelity output, making FORA a practical solution for the efficient deployment of high-performance diffusion transformers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fast-Forward Caching (FORA).