Block Discrete Denoising Diffusion LMs
- BD3-LMs are discrete generative models that merge autoregressive and denoising diffusion methods by partitioning text into blocks for parallel sampling.
- They employ blockwise supervised fine-tuning and gradient variance control to optimize training efficiency and align inference with block-level generation.
- The draft-then-refine strategy enhances global coherence and controllability, substantially reducing perplexity and outperforming traditional models.
Block Discrete Denoising Diffusion LLMs (BD3-LMs) are a class of discrete generative LLMs that synergistically combine properties of autoregressive (AR) and denoising diffusion models to enable efficient, scalable, and parallelizable text generation. BD3-LMs partition sequences into blocks, leveraging parallel denoising within blocks while maintaining semi-autoregressive dependencies across blocks. This approach offers significant advances in controllable, flexible-length generation and quality, narrowing the longstanding gap between diffusion-based and autoregressive language modeling (Arriola et al., 12 Mar 2025, Ma et al., 20 Jan 2026).
1. Mathematical Formulation of Block Discrete Denoising Diffusion
BD3-LMs operate by dividing a length- token sequence into contiguous, non-overlapping blocks of size , denoted with . The generative process in each block proceeds in two stages:
- Forward Noising: For continuous time , within each block, every token is kept with probability or masked with probability :
The masking schedule decays monotonically, ensuring a controlled progression from noise-free to fully masked sequences.
- Reverse Denoising: The model, parameterized by a neural network , denoises blocks in an autoregressive blockwise order:
where denotes the previously denoised blocks, and the conditional is modeled as a block-wise categorical distribution. Training minimizes a reweighted cross-entropy loss at each noise level using a continuous ELBO analogue:
This blockwise decomposition supports scalable training, efficient inference via block-level KV caching, and enables parallel within-block sampling (Arriola et al., 12 Mar 2025, Ma et al., 20 Jan 2026).
2. Semi-Autoregressive Blockwise Generation and Decoding Paradigm
BD3-LMs sample sequences in blocks: each block is generated in parallel conditioned on all prior blocks, followed by the application of a discrete denoising diffusion process within the block. The model factorizes the total sequence probability as
where the conditional for each block is parameterized by the reverse diffusion model. This procedure interpolates between two extremes:
- Setting recovers a fully autoregressive model (one token per block).
- Setting yields a standard denoising diffusion LLM with bidirectional context but no autoregressive dependency (Arriola et al., 12 Mar 2025).
The blockwise approach allows BD3-LMs to benefit from parallel within-block sampling and AR-contextual information between blocks, enabling flexible-length, scalable, and memory-efficient generation with support for KV caching.
3. Training and Optimization Strategies
Several key innovations in training regimes underpin the effectiveness of BD3-LMs:
- Blockwise Supervised Fine-Tuning (Blockwise SFT): Standard SFT, which randomly masks tokens across the sequence, induces a mismatch with blockwise decoding at inference (i.e., noisy prefixes and leaky suffixes). Blockwise SFT addresses this by sampling a single active block for masking during each training step, freezing all previous tokens and fully masking subsequent ones. Loss is computed only on the active block, aligning the granularity of training with inference (Sun et al., 27 Aug 2025).
- Gradient Variance Control: Blockwise token masking leads to higher stochastic gradient variance, primarily when small block sizes are used. Clipped, data-driven mask-rate schedules are introduced: , with parameters optimized using empirical variance minimization to stabilize training and improve likelihood (Arriola et al., 12 Mar 2025).
- Mix-Scale Training: BD3-LMs trained for draft-then-refine protocols (see below) utilize bimodal block-size distributions in training: most updates use small drafting blocks (e.g., ), and a minority use large global blocks (e.g., ). This approach enhances both high-quality local drafts and global revision capability, improves robustness, and prevents overfitting to a single block size (Ma et al., 20 Jan 2026).
4. Extensions: Draft-then-Refine and Controllable Generation
Draft-then-Refine ("Diffusion-in-Diffusion")
Traditional BD3-LMs suffer from irreversibility and limited global planning due to their strict unidirectional block dependencies. The Diffusion-in-Diffusion framework overcomes these issues by a two-stage pipeline:
- Draft Stage: Rapidly generate a sequence using blockwise diffusion with small blocks. Maintain per-token "snapshot confidence" scores, capturing the denoiser's predicted certainty at the sampling step.
- Refine Stage: Identify a fraction of lowest-confidence tokens and remask them. Employ a single, large block (or a few large blocks), enabling full-sequence bidirectional denoising to revise only these positions. This process corrects long-range inconsistencies and injects global coherence without redoing the entire sample (Ma et al., 20 Jan 2026).
The protocol is as follows:
| Stage | Block Size | Masking Strategy | Contextual Range |
|---|---|---|---|
| Draft | Small (e.g., 4) | Regular blockwise diffusion | Leftward AR |
| Refine | Large (e.g., 1024) | Remask lowest-confidence tokens | Bidirectional/global |
Empirical results show that with just 26% of the fine-tuning budget, the full draft-refine method lowers OpenWebText generative PPL from 25.7 (draft) or 25.7 (BD3-LM baseline) to 21.9, sharply narrowing the gap to AR models (PPL=14.1) and outperforming one-stage BD3-LMs by ≈20% relative (Ma et al., 20 Jan 2026).
Controllable and Dynamic Block Generation
CtrlDiff extends BD3-LM by supporting dynamic, context-sensitive block sizing (using reinforcement learning) and classifier-guided conditional generation:
- Block-Size Adaptation: The next block size is adaptively chosen based on features of the preceding blocks (e.g., hidden states, token entropy), maximizing efficiency and preserving output quality. RL (PPO) optimizes this policy (Huang et al., 20 May 2025).
- Conditional Generation: Guidance from an external classifier is injected into the reverse diffusion process, aligning sampled outputs with target attributes (e.g., sentiment) via efficient Taylor approximations (Huang et al., 20 May 2025). This process enables post-hoc control without retraining.
5. Model Architecture and Implementation
BD3-LMs are implemented as Transformer-based architectures, with distinctive attention masks and inference workflows:
- Block-Causal Transformer with KV Caching: During generation, each new block attends to all previously generated blocks (via cached keys/values) and to local noisy tokens within the current block. Clean tokens leverage standard block-causal masks. Specialized vectorized attention enables batch denoising of all blocks in a single forward pass, improving throughput (e.g., ≈5× speedups vs. naive masking approaches) (Arriola et al., 12 Mar 2025).
- Parallelism and Flexibility: Once a token is unmasked during denoising, it remains so, ensuring efficient diffusion and inference. Blocks are generated until a sequence boundary token is encountered or low-entropy criteria are satisfied, supporting flexible-length sampling without special EOS/BOS handling (Arriola et al., 12 Mar 2025).
6. Empirical Performance and Benchmarking
Empirical studies across BD3-LM variants reveal several findings:
- Perplexity Benchmarks: On OpenWebText (context length 1024), AR models achieve PPL=14.1, while pure masked diffusion models (e.g., SEDD, MDLM) yield PPL ≈ 46–52. BD3-LMs with achieve PPL=25.7 (Arriola et al., 12 Mar 2025). The draft-then-refine protocol further reduces PPL to 21.9 (at 1.5K NFEs), using only 26% the baseline's fine-tuning steps (Ma et al., 20 Jan 2026).
- Scaling and Ablations: The block size for global refinement must be sufficiently large () to achieve gains; overly aggressive remasking () reduces structural fidelity in draft samples (Ma et al., 20 Jan 2026).
- Training-Inference Alignment: Blockwise SFT consistently outperforms classical SFT, demonstrating improved accuracy on GSM8K and MATH benchmarks, best when the training and inference block sizes are matched (Sun et al., 27 Aug 2025).
- Conditional and Dynamic Generation: RL-based dynamic block sizes yield 1–2% PPL reduction and ~20% fewer diffusion steps; classifier-guided sampling achieves attribute alignment improvements (+12 pp) with little PPL overhead (Huang et al., 20 May 2025).
| Model | OWT PPL () | LM1B PPL | Pass@1 GSM8K |
|---|---|---|---|
| AR (GPT2-L) | 14.1 | 23.5 | – |
| SEDD | 52.0 | ≤32.68 | – |
| MDLM | 46.8 | ≤31.78 | – |
| BD3-LM (block 4) | 25.7 | 28.23 | – |
| BD3-LM Draft-Refine (ours) | 21.9 | – | – |
| CtrlDiff | 20.12 | 27.78 | – |
| Blockwise SFT | – | – | 76.0 |
7. Discussion, Trade-offs, and Future Directions
BD3-LMs offer strong practical advantages: high-quality generation that approaches AR models, scalable parallel sampling, efficient memory usage via KV caching, and flexible-length output. The design space of block size enables a trade-off between generation quality, inference parallelism, and sample diversity.
Extensions—including draft-refine pipelines, block-size adaptation, and classifier-guided sampling—greatly enhance the flexibility and controllability of these models. Notably, snapshot confidence remasking emerges as a crucial planning mechanism, outperforming random and post-hoc masking approaches.
Persisting challenges include the tuning of the mask-rate schedule, gradient variance at small block sizes, and the subtle alignment of training and inference protocols. The broader relevance of BD3-LMs lies in their unification of paradigms, enabling new regimes of controllable, efficient, and high-fidelity language modeling across both research and application domains (Arriola et al., 12 Mar 2025, Huang et al., 20 May 2025, Sun et al., 27 Aug 2025, Ma et al., 20 Jan 2026).