Explaining the near-equivalence of masking granularities via diagonal preconditioning limitations
Determine whether the observed near-equivalence in validation perplexity across masking granularities (element-wise, column-wise, and block-wise) for SkipUpdate—a masked variant of RMSProp that randomly skips parameter-block updates while maintaining dense moment updates—arises from the limited ability of diagonal preconditioning matrices (e.g., the per-parameter diag(v_t)^{-1/2} used by RMSProp) to exploit dense within-block curvature, thereby implying that finer-grained masking provides only marginal practical benefit in transformer pre-training.
References
We conjecture that this near-equivalence reflects the limited ability of diagonal preconditioning to exploit dense within-block curvature, rendering finer-grained masking of marginal practical benefit.