Fast-Forward Caching (FORA) for Diffusion Models

Updated 27 November 2025

Fast-Forward Caching (FORA) is a training-free, plug-and-play technique for diffusion transformers that caches intermediate outputs to eliminate redundant computations.
It exploits high feature similarity across reverse diffusion steps to achieve up to 80% FLOP reduction while managing quality trade-offs.
The method integrates seamlessly with models like DiT and PixArt-α, enabling adjustable speed–quality balances through a configurable caching interval.

Fast-Forward Caching (FORA) is a training-free, plug-and-play acceleration technique specifically designed for transformer-based diffusion models (notably, Diffusion Transformers or DiTs). FORA’s core innovation is the exploitation of high feature redundancy between successive time steps in the reverse diffusion process, achieved by caching and reusing the outputs of computationally expensive self-attention and MLP layers across denoising steps. This mechanism yields substantial reductions in inference time and computational load, with negligible impact on sample quality under appropriately chosen caching intervals. FORA is compatible with widely used architectures such as DiT and PixArt-α and integrates seamlessly with typical diffusion samplers, requiring no retraining or model modification (Selvaraju et al., 2024).

1. Caching Mechanism and Theoretical Basis

FORA leverages the empirical observation that intermediate features within attention and MLP layers—specifically, the tensors $\mathbf{h}^{(k)}_{t, \mathrm{attn}}$ and $\mathbf{h}^{(k)}_{t, \mathrm{mlp}}$ at time step $t$ and layer $k$ —exhibit high similarity across adjacent reverse diffusion steps. Formally, let $T$ denote the total denoising steps, $L$ the number of transformer blocks, $S$ the sequence length, and $d$ the embedding dimension.

For each layer $k$ , at each time $t$ , the computations are:

$\mathbf{h}^{(k)}_{t, \mathrm{attn}} = \mathrm{Attn}_k(\mathbf{h}^{(k-1)}_t)$ ,
$\mathbf{h}^{(k)}_{t, \mathrm{mlp}} = \mathrm{MLP}_k(\mathbf{h}^{(k)}_{t, \mathrm{attn}})$ .

FORA introduces layer-wise caches:

cache[Attn] [ $k$ ] $\leftarrow$ $\mathbf{h}^{(k)}_{t, \mathrm{attn}}$
cache[MLP] [ $k$ ] $\leftarrow$ $\mathbf{h}^{(k)}_{t, \mathrm{mlp}}$

With caching interval $N$ , the forward pass is computed in full at steps where $t \bmod N = 0$ (including $t=T$ ). During the $N-1$ subsequent steps, attention and MLP modules are skipped and their cached outputs are reused. This results in a substantial reduction of redundant operations, particularly in $\mathcal{O}(S d^2)$ FLOPs from attention/MLP layers.

2. Algorithmic Implementation

The integration of FORA into the standard diffusion inference loop operates as follows:

Input: pretrained DiT (L blocks), scheduler, total steps T, interval N
Initialize: caches for attention and MLP features per layer

x_T ← sample noise
for t = T, T-1, ..., 1:
    if t % N == 0:
        h = x_t
        for k = 1 to L:
            h_attn = Attn_k(h)
            cache_attn[k] = h_attn
            h_mlp  = MLP_k(h_attn)
            cache_mlp[k] = h_mlp
            h = h_mlp
    else:
        for k = 1 to L:
            h = cache_mlp[k]
    x_{t-1} = scheduler.step(h, x_t, t)
return x_0

(Selvaraju et al., 2024)

Computationally, the total cost per block without FORA is given by:

$C_{\mathrm{orig}} = T \times L \times 2 S d^2$

With FORA (static caching every $N$ steps):

$C_{\mathrm{FORA}} = \frac{T}{N} \times L \times 2 S d^2 + (T - \frac{T}{N}) \times L \times \mathcal{O}(S d)$

The resulting reduction in FLOPs is approximately $1 - \frac{1}{N}$ . For $N=5$ , this yields up to 80% computational savings per full sampling trajectory.

3. Integration with Transformer-Based Diffusion Models

FORA operates entirely at inference-time, requiring no modification of model weights or structure. As a plug-in wrapper, it is compatible with standard transformer-based diffusion models including DiT and PixArt-α, and functions with prevalent samplers such as DDIM, DPM-Solver, and IDDPM. The method enables acceleration across both class-conditional and text-conditional generative tasks.

The memory overhead introduced by FORA is determined by two cached tensors of shape $[S \times d]$ per transformer layer, resulting in

$M_{\mathrm{cache}} = 2 L S d \times \texttt{sizeof(float)}$

In a DiT-XL/2 configuration ( $L=28$ , $S=256$ , $d=1024$ ), this amounts to approximately 56 MB, or 10% of a 600 MB model. Using half-precision reduces this further.

4. Empirical Performance and Trade-Offs

FORA’s efficacy is characterized by the speed–quality trade-off governed by the interval $N$ . On ImageNet $256^2$ with DiT-XL/2-G and 250 DDIM steps, results are as follows:

Setting	Speedup	FID	IS
Baseline	1×	2.27	278.24
FORA $N=2$	2.08×	2.40	269.74
FORA $N=3$	2.80×	2.82	253.96
FORA $N=5$	4.57×	4.97	222.97
FORA $N=7$	5.73×	9.80	180.54

For text-conditional MS-COCO $256^2$ (PixArt-α, guidance 4.5):

Baseline	Steps	FID	FORA $N=2$ Speedup	FORA $N=2$ FID
DPM-Solver	20	28.15	1.54×	38.20
IDDPM	100	43.54	1.86×	55.30

Larger $N$ increases acceleration but results in progressive degradation of FID and visual detail. Empirically, for ImageNet, setting $N=3$ achieves optimal balance ( $2.80\times$ speed, FID = 2.82, IS = 253.96).

5. Computational and Memory Analysis

The computational complexity for the attention and MLP modules per block per step is $\mathcal{O}(S d^2)$ . Static caching every $N$ steps reduces redundant FLOPs by $(1 - 1/N)$ , while the associated memory overhead remains a relatively minor fraction of typical modern model sizes. For example, caching overhead for DiT-XL/2 is approximately 56 MB in single-precision. When deployed in half-precision, this halves the auxiliary memory footprint, making the technique suitable for both large-scale servers and on-device applications.

6. Practical Considerations and Application Settings

FORA is adaptable to user requirements via selection of the interval $N$ . This permits explicit control over the speed–quality trade-off, enabling real-time or on-device diffusion inference under tight resource constraints. The approach applies broadly to both class-conditional and text-conditional generation tasks, and is compatible with typical sampling strategies in diffusion probabilistic models. FORA’s method—reusing cached outputs without model retraining—broadens its applicability to pre-trained diffusion transformers across various domains (Selvaraju et al., 2024).

7. Summary of Contributions

FORA’s main technical contributions are as follows: (1) introduction of static layerwise caching of attention and MLP outputs within the inference trajectory of transformer-based diffusion models; (2) compatibility as a training-free, easily integrated plug-in for mainstream DiT and PixArt-α architectures; (3) quantification of computational savings—up to $5\times$ speed-up with only modest impacts on FID and IS metrics, and (4) maintenance of manageable memory overhead that is small compared to overall model size. Empirical findings suggest that moderate cache intervals ( $N=3$ or $4$) suffice for substantial acceleration with high-fidelity output, making FORA a practical solution for the efficient deployment of high-performance diffusion transformers.

Markdown Report Issue Upgrade to Chat

References (1)

FORA: Fast-Forward Caching in Diffusion Transformer Acceleration (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fast-Forward Caching (FORA).