Discrete Diffusion LLMs

Updated 1 February 2026

Discrete diffusion LLMs are non-autoregressive models that iteratively restore text from a fully masked state using structured noise schedules and bidirectional transformers.
They utilize adaptive caching strategies, including prompt and response caches, to minimize redundant computations and lower inference latency.
Experimental results indicate that these models achieve significant throughput gains, with speedups up to 9.1× and minimal quality degradation.

Discrete Diffusion LLMs (dLLMs) are a class of non-autoregressive sequence models that generate text by iterative, parallel denoising over discrete tokens, leveraging structured Markov corruption and bidirectional transformer architectures. Unlike conventional autoregressive models (ARMs), which factorize the joint probability of a sequence left-to-right and generate one token at a time, dLLMs represent generation as a multi-step reverse process from a highly corrupted (typically all-masked) state, iteratively restoring masked tokens in parallel via a learned denoising model. This paradigm enables high-throughput parallel decoding, global context modeling, and flexible quality-speed tradeoffs, but introduces unique algorithmic challenges and hardware bottlenecks, addressed by specialized inference and optimization techniques.

1. Mathematical Foundation and Generation Mechanism

Discrete diffusion LLMs construct a generative model over vocabulary sequences using a forward (corruption) chain and a learned reverse (denoising) chain. The forward process iteratively corrupts an input sequence $x_0=(x^1_0, ..., x^L_0)$ by replacing each token independently with a special [MASK] symbol according to a noise schedule $\beta_t$ : $q(x_t^i = v \mid x_{t-1}^i = u) = \begin{cases} 1-\beta_t & \text{if } v=u,\ \beta_t & \text{if } v = \mathrm{[MASK]},\ 0 & \text{otherwise}. \end{cases}$ The closed-form marginal after $t$ steps is

$q(x_t^i = v \mid x_0^i = u) = \begin{cases} \alpha_t & v = u,\ 1-\alpha_t & v = \mathrm{[MASK]},\ 0 & \text{otherwise}, \end{cases}$

where $\alpha_t = \prod_{s=1}^t (1-\beta_s)$ .

The reverse process is parameterized by a Transformer, $p_\theta(x_{t-1} \mid x_t)$ , which predicts the less-noised sequence given $x_t$ . At each step, for each masked token, the model outputs a categorical distribution over the vocabulary.

The training objective is a weighted cross-entropy over masked positions, aligning the model's predictions with the original tokens: $\mathcal{L}(\theta) = \mathbb{E}_{t, x_0, x_t} \left[ -w(t) \sum_{i=1}^L \mathbb{I}\{x_t^i=\mathrm{[MASK]}\} \log p_\theta(x_0^i \mid x_t) \right],$ where $w(t)$ normalizes for mask rate and sampling distribution (Liu et al., 17 May 2025, Song et al., 4 Aug 2025, Yu et al., 16 Jun 2025).

Inference begins with a masked response, $\beta_t$ 0 for $\beta_t$ 1 denoising steps. At each step, the Transformer predicts token distributions for masked positions using full (bidirectional) attention, then commits a subset of predictions based on rules such as confidence-based remasking or fixed block updates.

2. Architectural Distinctions and Inference Bottlenecks

dLLMs employ standard multi-layer Transformer backbones but always utilize bidirectional self-attention masks, enabling each token to attend to all others at every denoising step. This contrasts with ARMs which are limited to causal (unidirectional) attention and can thus cache key and value embeddings for previously computed tokens (KV-caching). The architectural structure of dLLMs prohibits straightforward adaptation of KV-caching due to cross-attention between prompt and response at every step.

Consequently, dLLMs face pronounced computational and memory overhead:

Each denoising iteration requires a full forward pass over all $\beta_t$ 2 tokens (prompt plus response), incurring $\beta_t$ 3 complexity per step.
For $\beta_t$ 4 steps, this leads to total compute of $\beta_t$ 5, resulting in high inference latency (Liu et al., 17 May 2025).
The static prompt is recomputed at each step, despite exhibiting near-identical internal representations (cosine similarity $\beta_t$ 6 across adjacent steps).
The majority of response tokens exhibit high stability between adjacent steps; empirical analysis shows over $\beta_t$ 7 correlation in feature vectors for tokens that do not change (Liu et al., 17 May 2025, Song et al., 4 Aug 2025, Yu et al., 16 Jun 2025).

These inefficiencies limit scalability—especially for long-context inference—and motivate specialized optimization strategies.

3. Adaptive Caching and Efficient Inference Algorithms

To address non-redundant computation and bridge the gap with autoregressive models in latency, dLLMs employ multi-tiered adaptive caching:

Prompt Cache: Intermediate key, value, attention, and feedforward features for prompt tokens are cached and only recomputed at infrequent intervals (e.g., every $\beta_t$ 8 steps). Between refreshes, cached features are directly reused.

Response Cache and Partial Updates: For response tokens, dLLM-Cache (Liu et al., 17 May 2025) uses feature-similarity-guided adaptive updates. At each step, new value projections are computed; tokens with the lowest cosine similarity to previous states (bottom $\beta_t$ 9) are refreshed, while the remainder reuse cached activations. This “V-verify” approach cuts per-step computation to $q(x_t^i = v \mid x_{t-1}^i = u) = \begin{cases} 1-\beta_t & \text{if } v=u,\ \beta_t & \text{if } v = \mathrm{[MASK]},\ 0 & \text{otherwise}. \end{cases}$ 0.

Pseudocode Structure: The resulting multi-level cache refresh policy (see Algorithm 1 in (Liu et al., 17 May 2025)) alternates full and partial refreshes based on modular counters, with detailed layer-level subroutines to split/reconstruct prompt/response representations. For benchmark models (LLaDA 8B, Dream 7B), dLLM-Cache delivers up to $q(x_t^i = v \mid x_{t-1}^i = u) = \begin{cases} 1-\beta_t & \text{if } v=u,\ \beta_t & \text{if } v = \mathrm{[MASK]},\ 0 & \text{otherwise}. \end{cases}$ 1 speedup at $q(x_t^i = v \mid x_{t-1}^i = u) = \begin{cases} 1-\beta_t & \text{if } v=u,\ \beta_t & \text{if } v = \mathrm{[MASK]},\ 0 & \text{otherwise}. \end{cases}$ 2 quality loss, matching ARM throughput within $q(x_t^i = v \mid x_{t-1}^i = u) = \begin{cases} 1-\beta_t & \text{if } v=u,\ \beta_t & \text{if } v = \mathrm{[MASK]},\ 0 & \text{otherwise}. \end{cases}$ 3 in various settings.

Experimental Results Table:

Model (Task)	Vanilla TPS	dLLM-Cache TPS	Speedup	Accuracy Drop
LLaDA 8B / GSM8K	7.32	31.43	4.29×	1.7%
Dream 7B / GSM8K	5.72	39.48	6.90×	none
LLaDA 8B / LongBench	—	—	9.1×	—

Prompt cache interval $q(x_t^i = v \mid x_{t-1}^i = u) = \begin{cases} 1-\beta_t & \text{if } v=u,\ \beta_t & \text{if } v = \mathrm{[MASK]},\ 0 & \text{otherwise}. \end{cases}$ 4 has negligible impact on quality, while the adaptive response update parameter $q(x_t^i = v \mid x_{t-1}^i = u) = \begin{cases} 1-\beta_t & \text{if } v=u,\ \beta_t & \text{if } v = \mathrm{[MASK]},\ 0 & \text{otherwise}. \end{cases}$ 5 provides a flexible trade-off between latency and minor accuracy loss (Liu et al., 17 May 2025).

4. Theoretical and Practical Implications

dLLMs fundamentally differ from ARMs both in modeling assumptions and practical inference:

ARMs factorize probability strictly left-to-right and generate with guaranteed prefix stability, enabling trivial cache reuse and fast O( $q(x_t^i = v \mid x_{t-1}^i = u) = \begin{cases} 1-\beta_t & \text{if } v=u,\ \beta_t & \text{if } v = \mathrm{[MASK]},\ 0 & \text{otherwise}. \end{cases}$ 6) inference.
dLLMs reconstruct the entire sequence in parallel, start from non-informative priors (all-mask), and must recompute context globally at every step due to the interdependence of all tokens.
While ARMs accelerate by caching previously computed tokens, direct adaptation of KV caching is impossible in dLLMs; computational reuse requires exploiting the quasi-static nature of prompts and token-wise response stability.

A critical implication is that, with prompt-aware and token-similarity-based adaptive caching, dLLMs can approach ARM-level efficiency despite architectural constraints. For open-source models up to 8B parameters, empirical throughput now approaches that of LLaMA3 ARMs with only a proportional increase in hardware overhead. The main hardware constraint at very low $q(x_t^i = v \mid x_{t-1}^i = u) = \begin{cases} 1-\beta_t & \text{if } v=u,\ \beta_t & \text{if } v = \mathrm{[MASK]},\ 0 & \text{otherwise}. \end{cases}$ 7 remains kernel launch and data movement costs, suggesting that further hardware-aware scheduling and kernel fusion could unlock additional gains.

Scalability beyond 8B remains to be demonstrated in open-source models; current state-of-the-art scaling and hardware utilization have only been validated up to this regime (Liu et al., 17 May 2025).

5. Challenges, Limitations, and Research Directions

Despite advances in adaptive caching and throughput, certain limitations persist:

The need to uniformly recompute prompt representations across all denoising steps for models with more dynamic prompts or non-response-aware tasks.
Fixed cache refresh parameters ( $q(x_t^i = v \mid x_{t-1}^i = u) = \begin{cases} 1-\beta_t & \text{if } v=u,\ \beta_t & \text{if } v = \mathrm{[MASK]},\ 0 & \text{otherwise}. \end{cases}$ 8, $q(x_t^i = v \mid x_{t-1}^i = u) = \begin{cases} 1-\beta_t & \text{if } v=u,\ \beta_t & \text{if } v = \mathrm{[MASK]},\ 0 & \text{otherwise}. \end{cases}$ 9, $t$ 0) introduce suboptimality; a dynamic, feature- or prompt-dependent policy could further reduce overhead.
Kernel launch and data transfer bottlenecks at small $t$ 1 (very few token updates per step) prevent full realization of theoretical speedup.
The efficacy of current schemes at model scales exceeding 33B parameters is unproven in open academic models.
Non-parallelizable components and score propagation limits (e.g., in the context of complex structured outputs) still constrain certain achievable speed/quality frontiers (Liu et al., 17 May 2025).

Emergent research aims to further:

Incorporate online feature similarity profiling and prompt length–aware adaptation of caching intervals in deployment.
Fuse kernel launches for composite operations to minimize memory and transfer overhead.
Integrate dLLM-Cache with other system-level innovations (e.g., global memory planners, step-skipping kernels, sparse attention) for holistic scaling to 100B+ models.
Systematize evaluation of quality-latency trade-offs across a broader task spectrum and larger diffusion step budgets.

dLLM-Cache establishes an essential bridge between the parallel architecture of discrete diffusion LLMs and the efficiency demanded by real-world language modeling applications.

6. Conclusion

Discrete diffusion LLMs represent a significant shift in generative modeling by enabling iterative, parallel refinement of sequences through structured denoising. Their bidirectional transformer architecture and global context modeling fundamentally preclude naïve adoption of autoregressive acceleration methods. The introduction of adaptive, training-free caching frameworks—specifically, prompt- and response-level partial update schemes guided by feature stability—enables dLLMs to approach the inference efficiency of state-of-the-art ARMs without sacrificing output quality. Scaling dLLMs further, tuning dynamic adaptation strategies, and eliminating remaining system bottlenecks remain active challenges toward full practical deployment at frontier model scales (Liu et al., 17 May 2025).