Approximate linearity of two-layer MLP training dynamics under small initialization

Establish that, for the two-layer MLP f_W1(X) = W2 φ(W1 X) trained by gradient descent on squared-error loss with fixed second-layer matrix W2, smooth activation φ with derivative φ′(0), whitened inputs (X X^T = I_d), and small semi-orthogonal initialization of W1, the network remains approximately linear throughout training in the sense that W2 φ(W1(t) X) ≈ φ′(0) · W2 W1(t) X and, consequently, the first-layer gradient G1(t) is close to the gradient of the corresponding linearized model. Quantify the approximation error and identify conditions (in terms of initialization scale, step size, and training duration) under which this approximation holds uniformly over training iterations.

Background

In the appendix, the authors empirically observe that the (K+1)-st singular value of the first-layer gradient remains small during training across several smooth activations, suggesting a persistent approximate low-rank structure.

To explain this observation, they hypothesize that under their setting (whitened inputs, squared-error loss, fixed second layer, smooth activation, small initialization), the network behaves approximately linearly during training, so that the nonlinear forward map is close to its first-order Taylor approximation around zero, and the resulting gradient resembles that of a linear model up to a small perturbation.

Formalizing and proving such an approximation would provide theoretical justification for the empirically observed gradient rank behavior and support the paper’s broader narrative on emergent low-rank training dynamics in smooth-activation MLPs.

References

We conjecture this is because under our setting, the network is approximately linear, i.e., \bm W_2 \phi\left( \bm W_1(t) \bm X \right) \approx \phi'(0) \cdot \bm W_2 \bm W_1(t) \bm X, and so \bm G_1(t) \approx \left( \nabla_{\bm W_1} \frac{1}{2} \cdot \left| \phi'(0) \cdot \bm W_2 \bm W_1(t) \bm X - \bm Y \right |_F2 \right) + \bm E for some ``small'' perturbation term \bm E.

Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations  (2602.06208 - Xu et al., 5 Feb 2026) in Appendix, Section 'Empirical Justifications for Assumption 5' (sec:smooth_assum_justify)