Dynamic Two-Layer MLPs

Updated 16 February 2026

Dynamic two-layer MLPs are neural networks with smooth activations where gradient updates concentrate in a fixed low-dimensional subspace established at initialization.
This emergent subspace behavior supports low-rank training methods that reduce memory and compute demands while preserving model performance.
Theoretical analysis and empirical validation highlight that proper initialization and smooth nonlinearity are crucial for maintaining subspace invariance during training.

Dynamic two-layer multilayer perceptrons (MLPs) with smooth activations exhibit an emergent phenomenon whereby gradient-based training drives almost all weight changes within a fixed low-dimensional subspace determined at initialization. This behavior, observed under both full-batch and stochastic training regimes, underlies the success of low-rank training, compression, and adaptation methods and can be exploited via explicit architectural parameterizations to yield substantial reductions in both memory and compute cost while achieving performance parity with fully-parameterized models (Xu et al., 5 Feb 2026).

1. Mathematical Formulation and Model Setup

Consider a two-layer MLP of the form

$f_{W_1}(X) = W_2\,\phi\big(W_1 X\big)$

where $X \in \mathbb{R}^{d \times N}$ denotes whitened input data ( $XX^\top = I_d$ ), $Y \in \mathbb{R}^{K \times N}$ the target labels ( $K < d/2$ ), $W_1 \in \mathbb{R}^{m \times d}$ the learned first-layer weights, and $W_2 \in \mathbb{R}^{K \times m}$ a fixed, full-row-rank second-layer matrix. The entrywise nonlinearity $\phi: \mathbb{R} \to \mathbb{R}$ is assumed smooth ( $|\phi'|\le\beta,\,|\phi''|\le\mu$ ). The loss function is

$L(W_1) = \frac{1}{2}\|W_2\,\phi(W_1 X) - Y\|_F^2,$

and training proceeds by gradient descent (GD) or its stochastic variants on $W_1$ , with updates

$W_1(t+1) = W_1(t) - \eta\,\nabla_{W_1}L\big(W_1(t)\big).$

2. Training Dynamics and Subspace Invariance

Backpropagation yields the gradients

$\Delta_2(t) = W_2 \phi(W_1(t)X) - Y,\quad \Delta_1(t) = W_2^\top \Delta_2(t) \odot \phi'(W_1(t)X),$

$\nabla_{W_1}L(W_1(t)) = \Delta_1(t) X^\top.$

A key finding is that GD updates to $W_1$ concentrate in a fixed $2K$-dimensional subspace determined by the initialization and the initial gradient structure. Explicitly, let the SVD of the initial gradient be $G_1(0) = L_{1,1}(0)\Sigma_{1,1}(0)R_{1,1}(0)^\top + \ldots$ , with the top- $K$ subspaces specified by $R_{1,1}(0)$ and $L_{1,1}(0)$ . The complement, denoted $S_{\mathrm{small}}$ , has dimension $p = d - 2K$ and admits bases $V$ and $U$ for input and output spaces, with

$W_1(0)V = \epsilon U,\quad W_1(0)^\top U = \epsilon V,\quad W_1(0)^\top W_1(0) = \epsilon^2 I_d.$

The magnitude of the projected gradients and updates within $S_{\mathrm{small}}$ remains uniformly small throughout training: $\|G_1(t)V\|_F,\,\|G_1(t)^\top U\|_F \ll \|G_1(t)\|_F.$ Thus, the dynamics of $W_1$ are confined to the orthogonal active subspace orthogonal to $S_{\mathrm{small}}$ . Perturbation-theoretic arguments (e.g., Wedin’s sin Θ theorem) ensure that this subspace drifts only minimally during training.

3. Theoretical Conditions and Guarantees for Low-Rank Dynamics

Several conditions are required to guarantee the emergence and invariance of the low-dimensional subspace:

Input normalization: inputs must be whitened ( $XX^\top=I_d$ ), with $K < d/2$ .
Smooth nonlinearities: $|\phi'|\le\beta$ , $|\phi''|\le\mu$ .
Initialization: first-layer weights are small and semi-orthogonal ( $W_1(0)^\top W_1(0) = \epsilon^2 I_d$ ), with $\epsilon$ suitably bounded.
Learning rate: step size $\eta$ is at most $O(1/(\|W_2\|_1+\sigma_1^2(W_2)))$ .

Under these hypotheses:

The initial gradient is approximately rank- $K$ .
The singular subspaces of $G_1(t)$ associated with large singular values change at most exponentially slowly ( $O(e^{-ct})$ ).
Projected updates into $S_{\mathrm{small}}$ are $O(\epsilon)$ initially and decay as $O(\eta e^{-ct})$ throughout training.

4. Low-Rank Parameterization: Construction and Initialization

These findings motivate an explicit low-rank reparameterization of the two-layer MLP. If $V \in \mathbb{R}^{d\times r}$ , $U \in \mathbb{R}^{m\times r}$ are orthonormal bases of the “active” $2K$-dimensional subspace, then

$W_1 \longrightarrow U\,\widetilde W_1\,V^\top,\quad W_2 \longrightarrow W_2,$

where $\widetilde W_1 \in \mathbb{R}^{r\times r}$ contains the only learned parameters at this layer. This parameterization can be generalized to each intermediate layer of deeper MLPs.

The construction of $V, U$ leverages the initial gradient: a backward pass at $t=0$ identifies the top- $K$ subspaces of $\nabla_{W_1}L(0)$ ; the orthogonal complement to $S_{\mathrm{small}}$ is designated as $V$ , and $U$ is set via $U = W_1(0)V/\epsilon$ . Proper initialization within this subspace (“ $S_{\rm big}$ ”) is crucial; random subspace initialization leads to training failure.

5. Empirical Validation Across Architectures and Tasks

Extensive empirical results support the theoretical framework:

Synthetic two-layer networks ( $d=32,\,K=4$ ): with smooth nonlinearities (ELU, GELU, SiLU), the orthogonal complement subspaces ( $d-2K=24$ dimensions) experience minimal drift (sub-degree rotation) and near-constant singular values after thousands of GD steps. In contrast, non-smooth activations (ReLU variants) lead to significant subspace and singular value instability.
Deeper MLPs ( $L=4,\,m=72$ ): identical concentration of intermediate-layer weight changes to the active subspace.
Optimization variants: the phenomenon persists under minibatch SGD and Adam, unwhitened data, and cross-entropy loss.
Low-rank MLP on Fashion-MNIST ( $m=d=784,\,K=10$ ): low-rank MLP ( $r=2K=20$ ) initialized via the prescribed method matches both test-loss and accuracy of the full-width model over 1500 epochs, whereas random projection initialization fails to converge.
VGG-16 head on CIFAR-10: with the convolutional backbone frozen, a low-rank head ( $r=2K$ ) matches full-head performance within ±0.5% accuracy for full fine-tuning; for classifier-only tuning, gap narrows to ∼2% when doubling $r$ to $4K$.

6. Architectural and Practical Implications

This subspace-concentration phenomenon enables practical architectural modifications:

Memory and compute reduction: wrapping each layer with low-rank factors about the “dead” directions reduces both resource requirements and parameter counts without accuracy loss, provided initialization is performed properly.
Compatibility with deep architectures: insertions of low-rank wrappers generalize to multi-layer MLP settings, with all intermediate weight changes remaining concentrated as in the two-layer case.
Fine-tuning and adaptation: provides theoretical explanation for empirical successes of low-rank fine-tuning techniques (e.g., LoRA), where restricting optimization to a small, precomputed subspace is sufficient for high performance.

7. Open Problems and Research Directions

Several open theoretical and practical questions are prompted by these results:

The mechanism by which smooth activations stabilize the subspace, as opposed to non-smooth options (e.g., ReLU), warrants further characterization.
Relaxations of input whitening or small-initialization assumptions could broaden applicability.
Effects of additional forms of stochasticity (dropout, quantization, SGD noise) on subspace invariance remain to be systematically studied.
Connections to phenomena such as neural collapse, feature learning dynamics, and implicit bias in deep learning are not fully elucidated.
Extensions to architectures beyond MLPs—including convolutional networks, residual blocks, transformers—as well as online (incremental) subspace tracking methods are promising avenues for future research.

In summary, the analysis of dynamic two-layer MLPs with smooth nonlinearities reveals that the effective learning dynamics are confined to a sharply delimited, initialization-determined subspace. This behavior is preserved under typical training regimes and can be operationalized via low-rank parameterizations to achieve efficient, high-performing models across tasks and architectures (Xu et al., 5 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Two-Layer MLPs.