Approximate linearity of two-layer MLP training dynamics under small initialization
Establish that, for the two-layer MLP f_W1(X) = W2 φ(W1 X) trained by gradient descent on squared-error loss with fixed second-layer matrix W2, smooth activation φ with derivative φ′(0), whitened inputs (X X^T = I_d), and small semi-orthogonal initialization of W1, the network remains approximately linear throughout training in the sense that W2 φ(W1(t) X) ≈ φ′(0) · W2 W1(t) X and, consequently, the first-layer gradient G1(t) is close to the gradient of the corresponding linearized model. Quantify the approximation error and identify conditions (in terms of initialization scale, step size, and training duration) under which this approximation holds uniformly over training iterations.
References
We conjecture this is because under our setting, the network is approximately linear, i.e., \bm W_2 \phi\left( \bm W_1(t) \bm X \right) \approx \phi'(0) \cdot \bm W_2 \bm W_1(t) \bm X, and so \bm G_1(t) \approx \left( \nabla_{\bm W_1} \frac{1}{2} \cdot \left| \phi'(0) \cdot \bm W_2 \bm W_1(t) \bm X - \bm Y \right |_F2 \right) + \bm E for some ``small'' perturbation term \bm E.