Explaining the near-equivalence of masking granularities via diagonal preconditioning limitations

Determine whether the observed near-equivalence in validation perplexity across masking granularities (element-wise, column-wise, and block-wise) for SkipUpdate—a masked variant of RMSProp that randomly skips parameter-block updates while maintaining dense moment updates—arises from the limited ability of diagonal preconditioning matrices (e.g., the per-parameter diag(v_t)^{-1/2} used by RMSProp) to exploit dense within-block curvature, thereby implying that finer-grained masking provides only marginal practical benefit in transformer pre-training.

Background

The paper reports that, under 130M Llama pre-training on the C4 dataset, SkipUpdate achieves very similar perplexities across different masking granularities—element-wise, column-wise, and block-wise—while all of these substantially outperform the RMSProp baseline.

To interpret this phenomenon, the authors conjecture that diagonal preconditioning used by widely adopted adaptive optimizers does not sufficiently leverage dense within-block curvature, which would explain why refining masking granularity beyond blocks yields little practical gain. Validating or refuting this explanation would clarify when structured masking is most beneficial and guide optimizer design choices.

References

We conjecture that this near-equivalence reflects the limited ability of diagonal preconditioning to exploit dense within-block curvature, rendering finer-grained masking of marginal practical benefit.

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers  (2602.15322 - Joo et al., 17 Feb 2026) in Section 2, Impacts of structured masking