Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Two-Layer MLPs

Updated 16 February 2026
  • Dynamic two-layer MLPs are neural networks with smooth activations where gradient updates concentrate in a fixed low-dimensional subspace established at initialization.
  • This emergent subspace behavior supports low-rank training methods that reduce memory and compute demands while preserving model performance.
  • Theoretical analysis and empirical validation highlight that proper initialization and smooth nonlinearity are crucial for maintaining subspace invariance during training.

Dynamic two-layer multilayer perceptrons (MLPs) with smooth activations exhibit an emergent phenomenon whereby gradient-based training drives almost all weight changes within a fixed low-dimensional subspace determined at initialization. This behavior, observed under both full-batch and stochastic training regimes, underlies the success of low-rank training, compression, and adaptation methods and can be exploited via explicit architectural parameterizations to yield substantial reductions in both memory and compute cost while achieving performance parity with fully-parameterized models (Xu et al., 5 Feb 2026).

1. Mathematical Formulation and Model Setup

Consider a two-layer MLP of the form

fW1(X)=W2ϕ(W1X)f_{W_1}(X) = W_2\,\phi\big(W_1 X\big)

where XRd×NX \in \mathbb{R}^{d \times N} denotes whitened input data (XX=IdXX^\top = I_d), YRK×NY \in \mathbb{R}^{K \times N} the target labels (K<d/2K < d/2), W1Rm×dW_1 \in \mathbb{R}^{m \times d} the learned first-layer weights, and W2RK×mW_2 \in \mathbb{R}^{K \times m} a fixed, full-row-rank second-layer matrix. The entrywise nonlinearity ϕ:RR\phi: \mathbb{R} \to \mathbb{R} is assumed smooth (ϕβ,ϕμ|\phi'|\le\beta,\,|\phi''|\le\mu). The loss function is

L(W1)=12W2ϕ(W1X)YF2,L(W_1) = \frac{1}{2}\|W_2\,\phi(W_1 X) - Y\|_F^2,

and training proceeds by gradient descent (GD) or its stochastic variants on W1W_1, with updates

W1(t+1)=W1(t)ηW1L(W1(t)).W_1(t+1) = W_1(t) - \eta\,\nabla_{W_1}L\big(W_1(t)\big).

2. Training Dynamics and Subspace Invariance

Backpropagation yields the gradients

Δ2(t)=W2ϕ(W1(t)X)Y,Δ1(t)=W2Δ2(t)ϕ(W1(t)X),\Delta_2(t) = W_2 \phi(W_1(t)X) - Y,\quad \Delta_1(t) = W_2^\top \Delta_2(t) \odot \phi'(W_1(t)X),

so

W1L(W1(t))=Δ1(t)X.\nabla_{W_1}L(W_1(t)) = \Delta_1(t) X^\top.

A key finding is that GD updates to W1W_1 concentrate in a fixed $2K$-dimensional subspace determined by the initialization and the initial gradient structure. Explicitly, let the SVD of the initial gradient be G1(0)=L1,1(0)Σ1,1(0)R1,1(0)+G_1(0) = L_{1,1}(0)\Sigma_{1,1}(0)R_{1,1}(0)^\top + \ldots, with the top-KK subspaces specified by R1,1(0)R_{1,1}(0) and L1,1(0)L_{1,1}(0). The complement, denoted SsmallS_{\mathrm{small}}, has dimension p=d2Kp = d - 2K and admits bases VV and UU for input and output spaces, with

W1(0)V=ϵU,W1(0)U=ϵV,W1(0)W1(0)=ϵ2Id.W_1(0)V = \epsilon U,\quad W_1(0)^\top U = \epsilon V,\quad W_1(0)^\top W_1(0) = \epsilon^2 I_d.

The magnitude of the projected gradients and updates within SsmallS_{\mathrm{small}} remains uniformly small throughout training: G1(t)VF,G1(t)UFG1(t)F.\|G_1(t)V\|_F,\,\|G_1(t)^\top U\|_F \ll \|G_1(t)\|_F. Thus, the dynamics of W1W_1 are confined to the orthogonal active subspace orthogonal to SsmallS_{\mathrm{small}}. Perturbation-theoretic arguments (e.g., Wedin’s sin Θ theorem) ensure that this subspace drifts only minimally during training.

3. Theoretical Conditions and Guarantees for Low-Rank Dynamics

Several conditions are required to guarantee the emergence and invariance of the low-dimensional subspace:

  • Input normalization: inputs must be whitened (XX=IdXX^\top=I_d), with K<d/2K < d/2.
  • Smooth nonlinearities: ϕβ|\phi'|\le\beta, ϕμ|\phi''|\le\mu.
  • Initialization: first-layer weights are small and semi-orthogonal (W1(0)W1(0)=ϵ2IdW_1(0)^\top W_1(0) = \epsilon^2 I_d), with ϵ\epsilon suitably bounded.
  • Learning rate: step size η\eta is at most O(1/(W21+σ12(W2)))O(1/(\|W_2\|_1+\sigma_1^2(W_2))).

Under these hypotheses:

  • The initial gradient is approximately rank-KK.
  • The singular subspaces of G1(t)G_1(t) associated with large singular values change at most exponentially slowly (O(ect)O(e^{-ct})).
  • Projected updates into SsmallS_{\mathrm{small}} are O(ϵ)O(\epsilon) initially and decay as O(ηect)O(\eta e^{-ct}) throughout training.

4. Low-Rank Parameterization: Construction and Initialization

These findings motivate an explicit low-rank reparameterization of the two-layer MLP. If VRd×rV \in \mathbb{R}^{d\times r}, URm×rU \in \mathbb{R}^{m\times r} are orthonormal bases of the “active” $2K$-dimensional subspace, then

W1UW~1V,W2W2,W_1 \longrightarrow U\,\widetilde W_1\,V^\top,\quad W_2 \longrightarrow W_2,

where W~1Rr×r\widetilde W_1 \in \mathbb{R}^{r\times r} contains the only learned parameters at this layer. This parameterization can be generalized to each intermediate layer of deeper MLPs.

The construction of V,UV, U leverages the initial gradient: a backward pass at t=0t=0 identifies the top-KK subspaces of W1L(0)\nabla_{W_1}L(0); the orthogonal complement to SsmallS_{\mathrm{small}} is designated as VV, and UU is set via U=W1(0)V/ϵU = W_1(0)V/\epsilon. Proper initialization within this subspace (“SbigS_{\rm big}”) is crucial; random subspace initialization leads to training failure.

5. Empirical Validation Across Architectures and Tasks

Extensive empirical results support the theoretical framework:

  • Synthetic two-layer networks (d=32,K=4d=32,\,K=4): with smooth nonlinearities (ELU, GELU, SiLU), the orthogonal complement subspaces (d2K=24d-2K=24 dimensions) experience minimal drift (sub-degree rotation) and near-constant singular values after thousands of GD steps. In contrast, non-smooth activations (ReLU variants) lead to significant subspace and singular value instability.
  • Deeper MLPs (L=4,m=72L=4,\,m=72): identical concentration of intermediate-layer weight changes to the active subspace.
  • Optimization variants: the phenomenon persists under minibatch SGD and Adam, unwhitened data, and cross-entropy loss.
  • Low-rank MLP on Fashion-MNIST (m=d=784,K=10m=d=784,\,K=10): low-rank MLP (r=2K=20r=2K=20) initialized via the prescribed method matches both test-loss and accuracy of the full-width model over 1500 epochs, whereas random projection initialization fails to converge.
  • VGG-16 head on CIFAR-10: with the convolutional backbone frozen, a low-rank head (r=2Kr=2K) matches full-head performance within ±0.5% accuracy for full fine-tuning; for classifier-only tuning, gap narrows to ∼2% when doubling rr to $4K$.

6. Architectural and Practical Implications

This subspace-concentration phenomenon enables practical architectural modifications:

  • Memory and compute reduction: wrapping each layer with low-rank factors about the “dead” directions reduces both resource requirements and parameter counts without accuracy loss, provided initialization is performed properly.
  • Compatibility with deep architectures: insertions of low-rank wrappers generalize to multi-layer MLP settings, with all intermediate weight changes remaining concentrated as in the two-layer case.
  • Fine-tuning and adaptation: provides theoretical explanation for empirical successes of low-rank fine-tuning techniques (e.g., LoRA), where restricting optimization to a small, precomputed subspace is sufficient for high performance.

7. Open Problems and Research Directions

Several open theoretical and practical questions are prompted by these results:

  • The mechanism by which smooth activations stabilize the subspace, as opposed to non-smooth options (e.g., ReLU), warrants further characterization.
  • Relaxations of input whitening or small-initialization assumptions could broaden applicability.
  • Effects of additional forms of stochasticity (dropout, quantization, SGD noise) on subspace invariance remain to be systematically studied.
  • Connections to phenomena such as neural collapse, feature learning dynamics, and implicit bias in deep learning are not fully elucidated.
  • Extensions to architectures beyond MLPs—including convolutional networks, residual blocks, transformers—as well as online (incremental) subspace tracking methods are promising avenues for future research.

In summary, the analysis of dynamic two-layer MLPs with smooth nonlinearities reveals that the effective learning dynamics are confined to a sharply delimited, initialization-determined subspace. This behavior is preserved under typical training regimes and can be operationalized via low-rank parameterizations to achieve efficient, high-performing models across tasks and architectures (Xu et al., 5 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Two-Layer MLPs.