Auto-Regressive Masked Diffusion (ARMD)

Updated 26 January 2026

ARMD is a discrete generative modeling framework that unifies autoregressive methods with iterative masked diffusion, enabling order-agnostic and blockwise sampling.
It employs an ELBO-based training objective with random masking orders and learned scheduling to balance computational efficiency with generation quality.
ARMD has demonstrated success across language, vision, and video domains by achieving faster inference and lower perplexity through adaptive blockwise generation.

Auto-Regressive Masked Diffusion (ARMD) refers to a class of discrete generative models that blend the computational efficiency and flexibility of auto-regressive models (ARMs) with the parallelism and iterative refinement enabled by masked diffusion. ARMD generalizes and unifies numerous paradigms including order-agnostic autoregressive modeling and discrete-absorption diffusion, supports learned or prescribed generation orders, and enables blockwise or parallel sampling, closing the performance gap between ARMs and diffusion models while permitting strict control over computation-vs-quality trade-offs (Hoogeboom et al., 2021, Karami et al., 23 Jan 2026, Garg et al., 24 Nov 2025, Lavenant et al., 29 Oct 2025, Kong et al., 21 Jan 2026, Gu et al., 19 Nov 2025, Weng et al., 2023).

1. Formal Foundations and Model Structure

The ARMD framework is based on learning the reverse of a forward “corruption” process that iteratively destroys (masks or noises) input data. Specifically, for a D-dimensional discrete variable $x$ (e.g., a token sequence), the forward process produces a sequence of partial observations $(x^{(0)},x^{(1)},...,x^{(D)})$ where at each step, a coordinate is masked into a special absorbing state. This forward chain is strictly Markov and order-agnostic: the order in which tokens are masked can be fixed, random, or even learned.

The generative model parameterizes the reverse (denoising) process, recovering the original data by iteratively predicting masked values based on the available context. Formally, for a random permutation $\sigma\sim\text{Uniform}(S_D)$ indexing the masking order, each step conditions only on the visible (unmasked) coordinates:

$p_\theta(x_1,...,x_D) = \prod_{i=1}^D p_\theta(x_{\sigma(i)} \mid x_{\sigma(<i)}).$

This factorization is enforced by the masking structure, not by causal masking within the architecture. All ARMD variants ultimately minimize a variational lower bound (ELBO) on $\log p_\theta(x)$ , which simplifies to a single-step expected cross-entropy over (possibly random) orders (Hoogeboom et al., 2021, Garg et al., 24 Nov 2025).

Critically, ARMD supports block-wise unmasking: at each decoding step, a block of tokens can be denoised in parallel assuming a factorized conditional, resulting in a trade-off between sampling steps and induced bias relative to a true ARM (Lavenant et al., 29 Oct 2025, Karami et al., 23 Jan 2026).

2. Training Objectives and Decoding Orders

The loss in ARMD is derived from the ELBO, typically reducing to an expectation over random maskings:

$\mathbb{E}_{t,\sigma}\left[- D\cdot\frac{1}{D-t+1} \sum_{k:\sigma(k)\ge t} \log p_\theta(x_k \mid x_{\sigma(<t)})\right].$

This connects ARMD tightly to order-agnostic and any-order autoregressive objectives: for left-to-right decoding, the sequential order is canonical; under random $\sigma$ , the model learns to support a uniform mixture over all possible token orders (Hoogeboom et al., 2021, Karami et al., 23 Jan 2026, Garg et al., 24 Nov 2025).

A major development is the extension to learned-order ARMD via multivariate masking schedules: each token is assigned an individual, parameterized masking schedule $\alpha_\ell(t)$ which determines its masking time in the forward process. The continuous-time ELBO can be shown to decompose exactly as a weighted mixture of auto-regressive losses over orders, with weights determined by the masking schedule parameters (Garg et al., 24 Nov 2025). Parameterizing $\alpha_\ell(t)$ with learnable exponents enables the model to optimize for beneficial decoding orders, breaking the permutation invariance of standard masked diffusion.

For practical training, scheduling strategies interpolate between pure left-to-right AR (strong inductive bias) and fully random (diffusion-style) orders. Progressive curricula begin with left-to-right, then gradually inject random permuted tokens, converging to a generalized block-causal dependency structure (Karami et al., 23 Jan 2026).

3. Model Architectures and Parallel Generation

Recent ARMD models employ permutation-equivariant, strictly causal architectures to safeguard the autoregressive factorization without explicit internal causal masks (Karami et al., 23 Jan 2026). Strictly causal self-attention masks the queries so that each token can only attend to blocks of tokens unmasked strictly before its own block. A two-stream attention stack processes both block-causal and strictly causal contexts in parallel, enabling all conditionals to be computed in a single batched forward pass. The result is that block-level autoregressive dependencies are enforced strictly by the attention mask and permutation encoding, not by left-to-right structure.

Sampling from ARMD is highly flexible. At one extreme, fully sequential generation matches a classic ARM. At the other, adaptive blockwise (or strided) parallel generation fills in multiple tokens per iteration, planning the schedule via dynamic programming or data-driven heuristics (Hoogeboom et al., 2021, Karami et al., 23 Jan 2026, Lavenant et al., 29 Oct 2025). The factorization bias introduced by blockwise sampling is analytically tractable and can be minimized by optimizing the schedule with respect to the data’s information profile—i.e., the average per-token conditional entropy as a function of the number of unmasked tokens (Lavenant et al., 29 Oct 2025). Closed-form, near-optimal schedules are derived by setting the block sizes $m_t \propto \sqrt{I_t}$ , where $(x^{(0)},x^{(1)},...,x^{(D)})$ 0 is the information content at step $(x^{(0)},x^{(1)},...,x^{(D)})$ 1.

In empirical language modeling, strided parallel generation, in which sequences are decomposed into interleaved streams and multiple tokens are generated per step, achieves substantial speed-ups with minimal degradation in perplexity (Karami et al., 23 Jan 2026).

4. Post-training, Model Conversion, and Mechanistic Shifts

A practical approach to ARMD is post-training: converting existing ARMs to masked diffusion models by removing causal masking, enabling bidirectional attention, and training under the ARMD objective. This procedure preserves the pre-trained parameters and architectural substrate, ensuring efficient adaptation. Post-training shifts the model from left-to-right sequential reasoning to iterative, context-integrating denoising (Kong et al., 21 Jan 2026).

Circuit and representation analyses reveal a systematic “mechanism shift” upon ARMD post-training. For local, causal tasks, ARMD reuses ARM circuits (e.g., induction heads in transformers), while for global planning tasks ARMD recruits distributed, front-loaded early-layer computation unobserved in ARMs. This indicates that ARMD does not merely repurpose existing heuristics but constructs new algorithmic pathways for bidirectional or global structured generation (Kong et al., 21 Jan 2026).

5. Theoretical Guarantees and Schedule Optimization

The trade-offs induced by blockwise ARMD sampling have been analyzed rigorously (Lavenant et al., 29 Oct 2025). The total Kullback-Leibler (KL) divergence between the ARMD and the true ARM decomposes into a learning error (approximating conditionals) and a factorization error:

$(x^{(0)},x^{(1)},...,x^{(D)})$ 2

where $(x^{(0)},x^{(1)},...,x^{(D)})$ 3 is the conditional total correlation—the mutual dependence among block variables. If denoising conditionals are perfect, this reduces purely to a block-factorization bias, which can be tightly bounded as a function of block sizes and the data's mutual information structure.

Schedules that minimize this bias are derived from the data’s intrinsic “information profile”. In the continuous limit, the optimal schedule (block sizes across steps) solves a variational problem, with closed-form solutions available. Empirically, non-uniform, information-adaptive schedules can achieve a marked reduction in KL bias without increasing computational cost.

6. Applications and Empirical Results

ARMD models have been deployed across language, vision, and video domains. In language modeling, ARMD architectures combine the low-step, high-quality generation of ARMs with parallel diffusion’s throughput, surpassing discrete diffusion (e.g., D3PM) at significantly reduced step counts and matching or outperforming AR baselines in perplexity and entropy (Karami et al., 23 Jan 2026, Hoogeboom et al., 2021).

In generative tabular modeling, ARMD with learnable scheduling demonstrates lower validation losses and more interpretable, task-aligned decoding orders than fixed or random schedulers, while remaining competitive with and at times outperforming established masked diffusion and tabular-diffusion baselines (Garg et al., 24 Nov 2025).

For image and video synthesis, hybrid ARMD models—such as those realized in ART·V—leverage framewise auto-regressive masked denoising to mitigate drift, preserve appearance, and support long-term consistency in video generation, outperforming comparable one-shot or unconditional methods (Weng et al., 2023). In image domains, distillation-based approaches (MARVAL) compress the ARMD denoising trajectory into a single AR step, enabling practical reinforcement learning post-training with fast inference and preference alignment, while retaining the compositional strengths of the underlying diffusion formulations (Gu et al., 19 Nov 2025).

7. Limitations, Variants, and Future Directions

While ARMD overcomes many computational and modeling barriers, it does introduce new modeling and algorithmic trade-offs:

Inference speed remains below that of a single-pass ARM if full sequential decoding is required, although strided or blockwise generation partially closes this gap.
Blockwise decoding’s factorization bias must be carefully managed via schedule optimization; information-adaptive schedules offer provable gains, but schedule selection remains data-dependent (Lavenant et al., 29 Oct 2025).
Mechanistic analyses suggest that while local ARM behaviors are preserved, global circuit rewiring may not always be desirable, and generalization to further reasoning tasks awaits validation (Kong et al., 21 Jan 2026).

Further research explores adaptive masking, hybrid continuous-discrete diffusions, reinforcement learning fine-tuning, curriculum design for schedule optimization, and extending ARMD’s strictly-causal attention to richer domains (e.g., multi-modal, program synthesis, multi-hop reasoning).

ARMD thus constitutes a foundational modeling framework unifying and generalizing autoregressive and masked diffusion methods, enabling flexible, efficient, and high-fidelity discrete generative modeling across modalities and application requirements.