SAE Feature Dynamics Predict Decoding-Order Performance

Establish whether the dynamics of sparse autoencoder features during denoising in diffusion language models provide a useful signal that correlates with task performance across different remasking-based decoding orders, including ORIGIN (random order), TOPK-MARGIN (highest margin first), and ENTROPY (lowest-entropy first).

Background

The paper analyzes how different decoding-order strategies in diffusion LLMs—ORIGIN, TOPK-MARGIN, and ENTROPY—affect internal representations by tracking sparse autoencoder (SAE) features across denoising steps. It introduces metrics for pre-mask stability and post-decode drift of top-K SAE features and reports that confidence-based orders tend to show larger early shifts for masked tokens and continued adjustments in deeper layers after decoding, whereas random ordering (ORIGIN) exhibits less change.

Based on these observations, the authors conjecture that such SAE-based representation dynamics can serve as a predictive or correlational signal for task performance across decoding orders. Resolving this conjecture would clarify whether monitoring SAE feature trajectories can systematically inform or optimize remasking strategies in diffusion LLMs.

References

We conjecture that these SAE-based dynamics provide a useful signal that correlates with task performance across decoding orders.

— DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders (2602.05859 - Wang et al., 5 Feb 2026) in Section 5.2 (Feature Dynamics Across Decoding Strategies)

SAE Feature Dynamics Predict Decoding-Order Performance

Background

References

Related Problems