Mechanistic differences between matrix and scalar/vector training dynamics

Characterize the mechanistic differences between noise-dominated dynamics of matrix parameters under weight decay and signal-dominated dynamics of scalar/vector parameters in large-language-model training, with the goal of explaining why learnable multipliers (scalar and vector) adapt scale while matrix weights are trapped by the noise–weight-decay equilibrium.

Background

The paper argues that weight decay and stochastic gradient noise induce an equilibrium norm for matrix parameters, limiting scale learning. In contrast, scalar and vector parameters (learnable multipliers) appear signal-dominated, adapt scale freely, and improve performance. Understanding the mechanistic distinction is necessary to formalize when and why multipliers escape the noise–WD equilibrium while matrices do not.

References

Yet, many questions are left open. Hence, an interesting direction for future work is to mechanistically understand the difference between matrix and scalar/vector dynamics, find an empirically measurable indicator of the noise level, or build a minimal mathematical model exhibiting both training regimes.

— Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers (2601.04890 - Velikanov et al., 8 Jan 2026) in Section 6: Conclusion and discussion

Mechanistic differences between matrix and scalar/vector training dynamics

Background

References

Related Problems