Block Discrete Denoising Diffusion LMs

Updated 27 January 2026

BD3-LMs are discrete generative models that merge autoregressive and denoising diffusion methods by partitioning text into blocks for parallel sampling.
They employ blockwise supervised fine-tuning and gradient variance control to optimize training efficiency and align inference with block-level generation.
The draft-then-refine strategy enhances global coherence and controllability, substantially reducing perplexity and outperforming traditional models.

Block Discrete Denoising Diffusion LLMs (BD3-LMs) are a class of discrete generative LLMs that synergistically combine properties of autoregressive (AR) and denoising diffusion models to enable efficient, scalable, and parallelizable text generation. BD3-LMs partition sequences into blocks, leveraging parallel denoising within blocks while maintaining semi-autoregressive dependencies across blocks. This approach offers significant advances in controllable, flexible-length generation and quality, narrowing the longstanding gap between diffusion-based and autoregressive language modeling (Arriola et al., 12 Mar 2025, Ma et al., 20 Jan 2026).

1. Mathematical Formulation of Block Discrete Denoising Diffusion

BD3-LMs operate by dividing a length- $L$ token sequence $x=[x_1,\dots,x_L]$ into $B=L/\mathcal{B}$ contiguous, non-overlapping blocks of size $\mathcal{B}$ , denoted $x=(x^1, \ldots, x^B)$ with $x^b \in \mathcal{V}^{\mathcal{B}}$ . The generative process in each block proceeds in two stages:

Forward Noising: For continuous time $t\in[0,1]$ , within each block, every token is kept with probability $\alpha_t$ or masked with probability $1-\alpha_t$ :

$q\left(x_t^b \mid x_0^b = x^b\right) = \prod_{j=1}^{\mathcal{B}} \left[ \alpha_t \mathbb{I}(x^b_{t,j} = x^b_j) + (1-\alpha_t) \mathbb{I}(x^b_{t,j} = m) \right].$

The masking schedule $\alpha_t$ decays monotonically, ensuring a controlled progression from noise-free to fully masked sequences.

Reverse Denoising: The model, parameterized by a neural network $p_\theta$ , denoises blocks in an autoregressive blockwise order:

$p_\theta(x^b_{t-\Delta t} \mid x^b_t, x^{<b}),$

where $x^{<b}$ denotes the previously denoised blocks, and the conditional is modeled as a block-wise categorical distribution. Training minimizes a reweighted cross-entropy loss at each noise level $t$ using a continuous ELBO analogue:

$\mathcal{L}_{\text{BD}}(x;\theta) = \sum_{b=1}^B \mathbb{E}_{t\sim[0,1]} \mathbb{E}_{x^b_t\sim q(\cdot|x^b)} \left[ \frac{\alpha_t'}{1-\alpha_t} \mathrm{CE}\left(p_\theta(\cdot \mid x^b_t, x^{<b}), x^b \right) \right].$

This blockwise decomposition supports scalable training, efficient inference via block-level KV caching, and enables parallel within-block sampling (Arriola et al., 12 Mar 2025, Ma et al., 20 Jan 2026).

2. Semi-Autoregressive Blockwise Generation and Decoding Paradigm

BD3-LMs sample sequences in blocks: each block is generated in parallel conditioned on all prior blocks, followed by the application of a discrete denoising diffusion process within the block. The model factorizes the total sequence probability as

$p(x) = \prod_{b=1}^{B} p_\theta(x^b \mid x^{<b}),$

where the conditional for each block is parameterized by the reverse diffusion model. This procedure interpolates between two extremes:

Setting $\mathcal{B} = 1$ recovers a fully autoregressive model (one token per block).
Setting $\mathcal{B} = L$ yields a standard denoising diffusion LLM with bidirectional context but no autoregressive dependency (Arriola et al., 12 Mar 2025).

The blockwise approach allows BD3-LMs to benefit from parallel within-block sampling and AR-contextual information between blocks, enabling flexible-length, scalable, and memory-efficient generation with support for KV caching.

3. Training and Optimization Strategies

Several key innovations in training regimes underpin the effectiveness of BD3-LMs:

Blockwise Supervised Fine-Tuning (Blockwise SFT): Standard SFT, which randomly masks tokens across the sequence, induces a mismatch with blockwise decoding at inference (i.e., noisy prefixes and leaky suffixes). Blockwise SFT addresses this by sampling a single active block for masking during each training step, freezing all previous tokens and fully masking subsequent ones. Loss is computed only on the active block, aligning the granularity of training with inference (Sun et al., 27 Aug 2025).
Gradient Variance Control: Blockwise token masking leads to higher stochastic gradient variance, primarily when small block sizes are used. Clipped, data-driven mask-rate schedules are introduced: $(1-t) \sim \text{Uniform}[\beta,\omega]$ , with parameters $(\beta, \omega)$ optimized using empirical variance minimization to stabilize training and improve likelihood (Arriola et al., 12 Mar 2025).
Mix-Scale Training: BD3-LMs trained for draft-then-refine protocols (see below) utilize bimodal block-size distributions in training: most updates use small drafting blocks (e.g., $\mathcal{B}_{\rm draft}=4$ ), and a minority use large global blocks (e.g., $\mathcal{B}_{\rm global}=1024$ ). This approach enhances both high-quality local drafts and global revision capability, improves robustness, and prevents overfitting to a single block size (Ma et al., 20 Jan 2026).

4. Extensions: Draft-then-Refine and Controllable Generation

Draft-then-Refine ("Diffusion-in-Diffusion")

Traditional BD3-LMs suffer from irreversibility and limited global planning due to their strict unidirectional block dependencies. The Diffusion-in-Diffusion framework overcomes these issues by a two-stage pipeline:

Draft Stage: Rapidly generate a sequence using blockwise diffusion with small blocks. Maintain per-token "snapshot confidence" scores, capturing the denoiser's predicted certainty at the sampling step.
Refine Stage: Identify a fraction $\gamma$ of lowest-confidence tokens and remask them. Employ a single, large block (or a few large blocks), enabling full-sequence bidirectional denoising to revise only these positions. This process corrects long-range inconsistencies and injects global coherence without redoing the entire sample (Ma et al., 20 Jan 2026).

The protocol is as follows:

Stage	Block Size	Masking Strategy	Contextual Range
Draft	Small (e.g., 4)	Regular blockwise diffusion	Leftward AR
Refine	Large (e.g., 1024)	Remask lowest-confidence tokens	Bidirectional/global

Empirical results show that with just 26% of the fine-tuning budget, the full draft-refine method lowers OpenWebText generative PPL from 25.7 (draft) or 25.7 (BD3-LM baseline) to 21.9, sharply narrowing the gap to AR models (PPL=14.1) and outperforming one-stage BD3-LMs by ≈20% relative (Ma et al., 20 Jan 2026).

Controllable and Dynamic Block Generation

CtrlDiff extends BD3-LM by supporting dynamic, context-sensitive block sizing (using reinforcement learning) and classifier-guided conditional generation:

Block-Size Adaptation: The next block size is adaptively chosen based on features of the preceding blocks (e.g., hidden states, token entropy), maximizing efficiency and preserving output quality. RL (PPO) optimizes this policy (Huang et al., 20 May 2025).
Conditional Generation: Guidance from an external classifier is injected into the reverse diffusion process, aligning sampled outputs with target attributes (e.g., sentiment) via efficient Taylor approximations (Huang et al., 20 May 2025). This process enables post-hoc control without retraining.

5. Model Architecture and Implementation

BD3-LMs are implemented as Transformer-based architectures, with distinctive attention masks and inference workflows:

Block-Causal Transformer with KV Caching: During generation, each new block attends to all previously generated blocks (via cached keys/values) and to local noisy tokens within the current block. Clean tokens leverage standard block-causal masks. Specialized vectorized attention enables batch denoising of all blocks in a single forward pass, improving throughput (e.g., ≈5× speedups vs. naive masking approaches) (Arriola et al., 12 Mar 2025).
Parallelism and Flexibility: Once a token is unmasked during denoising, it remains so, ensuring efficient diffusion and inference. Blocks are generated until a sequence boundary token is encountered or low-entropy criteria are satisfied, supporting flexible-length sampling without special EOS/BOS handling (Arriola et al., 12 Mar 2025).

6. Empirical Performance and Benchmarking

Empirical studies across BD3-LM variants reveal several findings:

Perplexity Benchmarks: On OpenWebText (context length 1024), AR models achieve PPL=14.1, while pure masked diffusion models (e.g., SEDD, MDLM) yield PPL ≈ 46–52. BD3-LMs with $\mathcal{B}=4$ achieve PPL=25.7 (Arriola et al., 12 Mar 2025). The draft-then-refine protocol further reduces PPL to 21.9 (at 1.5K NFEs), using only 26% the baseline's fine-tuning steps (Ma et al., 20 Jan 2026).
Scaling and Ablations: The block size for global refinement must be sufficiently large ( $\mathcal{B}\geq64$ ) to achieve gains; overly aggressive remasking ( $\gamma$ ) reduces structural fidelity in draft samples (Ma et al., 20 Jan 2026).
Training-Inference Alignment: Blockwise SFT consistently outperforms classical SFT, demonstrating improved accuracy on GSM8K and MATH benchmarks, best when the training and inference block sizes are matched (Sun et al., 27 Aug 2025).
Conditional and Dynamic Generation: RL-based dynamic block sizes yield 1–2% PPL reduction and ~20% fewer diffusion steps; classifier-guided sampling achieves attribute alignment improvements (+12 pp) with little PPL overhead (Huang et al., 20 May 2025).

Model	OWT PPL ( $L=1024$ )	LM1B PPL	Pass@1 GSM8K
AR (GPT2-L)	14.1	23.5	–
SEDD	52.0	≤32.68	–
MDLM	46.8	≤31.78	–
BD3-LM (block 4)	25.7	28.23	–
BD3-LM Draft-Refine (ours)	21.9	–	–
CtrlDiff	20.12	27.78	–
Blockwise SFT	–	–	76.0

7. Discussion, Trade-offs, and Future Directions

BD3-LMs offer strong practical advantages: high-quality generation that approaches AR models, scalable parallel sampling, efficient memory usage via KV caching, and flexible-length output. The design space of block size enables a trade-off between generation quality, inference parallelism, and sample diversity.

Extensions—including draft-refine pipelines, block-size adaptation, and classifier-guided sampling—greatly enhance the flexibility and controllability of these models. Notably, snapshot confidence remasking emerges as a crucial planning mechanism, outperforming random and post-hoc masking approaches.

Persisting challenges include the tuning of the mask-rate schedule, gradient variance at small block sizes, and the subtle alignment of training and inference protocols. The broader relevance of BD3-LMs lies in their unification of paradigms, enabling new regimes of controllable, efficient, and high-fidelity language modeling across both research and application domains (Arriola et al., 12 Mar 2025, Huang et al., 20 May 2025, Sun et al., 27 Aug 2025, Ma et al., 20 Jan 2026).