Supervised framework for where-to-unmask decisions in MDLMs

Develop an efficient supervised training framework that directly leverages ground-truth target sequences to learn the where-to-unmask decision policy—i.e., which positions to unmask at each reverse step—for Masked Diffusion Language Models, providing a practical alternative to heuristic confidence measures and reinforcement learning with on-policy rollouts.

Background

Masked Diffusion LLMs generate sequences by iteratively unmasking positions, requiring two coupled choices at each step: where-to-unmask (which positions to reveal) and what-to-unmask (which tokens to place). While standard training explicitly optimizes token prediction (what-to-unmask), the unmasking order (where-to-unmask) is typically decided by inference-time heuristics based on model confidence.

Recent approaches use reinforcement learning to optimize where-to-unmask, but this entails expensive on-policy rollouts. The paper highlights that, despite the availability of ground-truth sequences during training, there is no established efficient supervised framework to directly train the where-to-unmask decision, motivating a concrete open challenge.

References

Thus, an efficient supervised framework that directly leverages ground-truth sequences for training the where-to-unmask decision remains an open challenge.

Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models  (2602.09501 - Asano et al., 10 Feb 2026) in Introduction (Section 1)