Nonlinear Dynamics in Deep Linear Networks

Updated 29 January 2026

Nonlinear dynamics in deep linear networks are emergent behaviors arising from their compositional parameterization, leading to nonlinear training dynamics despite a linear input-output map.
The networks display distinct learning phases, with long plateaus followed by rapid transitions influenced by layer depth and initialization strategies.
Implicit regularization drives the convergence to rank-1 weight factors and maximal margin solutions, shedding light on optimization and generalization in deep learning.

Nonlinear dynamics in deep linear networks refer to the emergent nonlinear behavior at the level of weight evolution, trajectory geometry, and implicit regularization during the training of neural architectures whose end-to-end map is strictly linear. Despite the linearity of the input-output function, the multi-layer, multiplicative parameterization induces a highly nonlinear dynamic during gradient-based optimization. This nonlinearity manifests in the temporal evolution of the weights, the structure of critical points, implicit bias towards low-complexity solutions, learning phase structure, and in the ability (or lack thereof) to globally embed nonlinear dynamical systems into (higher-dimensional) linear flows via deep neural parameterizations. These phenomena serve as an analytically tractable setting to explore the foundations of generalization, optimization, and representation in deep learning.

1. Nonlinearity in the Learning Dynamics of Deep Linear Networks

The essential nonlinearity in deep linear networks (DLNs) arises not from the activation function, but from the compositional parameterization of the function class. An $L$ -layer deep linear network computes $\hat y = W_L W_{L-1}\cdots W_1 x$ , with the loss typically a sum over data points, e.g., mean squared error or logistic regression. The gradient flow (or gradient descent) updates, while yielding a strictly linear predictor at each moment, are governed by cubic interaction terms in the weights: $\tau \frac{dW^{(l+1,l)}}{dt} = \left(\prod_{i=l+2}^L W^{(i,i-1)}\right)^{T}\left[ \Sigma_{yx} - \left(\prod_{i=1}^L W^{(i,i-1)}\right)\Sigma_{xx}\right] \left(\prod_{i=1}^{l-1} W^{(i+1,i)} \right)^T,$ where $\Sigma_{yx}$ , $\Sigma_{xx}$ denote data covariances and $\tau$ is the inverse learning rate (Saxe et al., 2013).

This nonlinearity produces phase transitions, symmetry breaking, learning plateaus and bursts, as well as depth- and initialization-dependent learning speeds (Saxe et al., 2013, Basu et al., 2019).

The nonlinear ODEs governing deep linear networks admit a reduction to mode-wise dynamics under a (typically attainable) common singular vector basis for all layers. For each mode, the scalar dynamics obey: $\tau\frac{da^{(l)}_\alpha}{dt} = \left(\prod_{i\neq l} a^{(i)}_\alpha \right) \left( s_\alpha - \prod_{i=1}^{L-1} a^{(i)}_\alpha \right),$ where $a^{(l)}_\alpha$ are singular values of $W^{(l+1,l)}$ for mode $\alpha$ , and $s_\alpha$ is the data singular value (Saxe et al., 2013).

These equations reveal:

Plateau phase: For small initial $a^{(l)}_\alpha$ , the network exhibits a long quasi-stationary period.
Explosive transition: Once $u(t) \sim O(s/2)$ (where $u(t)$ is the product of $a^{(l)}_\alpha$ across layers), the mode is rapidly learned.
Depth (in)dependence: For specific orthogonal or pretrained initializations that align modes across layers, learning speed is independent of depth; otherwise, it slows with increasing $L$ (Saxe et al., 2013).

This behavior mirrors the “learning plateaus followed by rapid transitions” observed in nonlinear networks (Saxe et al., 2013, Basu et al., 2019).

3. Implicit Regularization and Alignment Phenomena

Despite the existence of infinitely many global minima in parameter space, deep linear networks trained with gradient dynamics under strictly monotone loss converge to minima characterized by strong structural properties:

Each weight matrix converges (in direction) to rank-1 factors, i.e., $W_k/\|W_k\|_F \to u_k v_k^T$ , and these factors become aligned across layers, $|v_{k+1}^T u_k|\to 1$ (Ji et al., 2018).
The end-to-end map aligns with the maximum margin solution for suitable losses (e.g., logistic), enforcing maximal margin in prediction space, analogous to SVMs (Ji et al., 2018).
Inter-layer “norm locking” ensures that all layers grow at matching rates, particularly under orthogonal or Glorot initialization, which prevents bottlenecks or runaway layers (Basu et al., 2019).

A summary of key invariant and convergence properties is presented below:

Property	Description	Reference
Rank-1 factorization	Asymptotic reduction of weight matrices to rank-1 forms	(Ji et al., 2018)
Inter-layer alignment	Top singular vectors between adjacent layers aligned	(Ji et al., 2018)
Margin maximization	End-to-end predictor converges directionally to max-margin solution	(Ji et al., 2018)

These effects are often termed “implicit regularization” and are hypothesized to underlie generalization in deep learning.

4. Critical Points, Saddle Dynamics, and Sparsity

The nonlinear geometry of DLN parameter space is characterized by a high-dimensional landscape filled with saddles, degenerate minima, and symmetry manifolds. The initialization scale determines the regime of training dynamics (Jacot et al., 2021):

NTK regime (large variance): Initialization is close to global minima but far from saddles; learning is essentially linear.
Saddle regime (small variance): Initialization is close to high-index saddles; training proceeds via a sequence of saddle-to-saddle escapes, incrementally increasing the rank of the realized map.
Greedy low-rank path: Training in the saddle regime follows paths that successively increase the rank by one at each saddle passage, with long plateaus between rank transitions. This mechanism leads to highly sparse, low-rank solutions and embodies the greedy singular value pursuit.

Empirical results show convergent rank-jumps in the realized matrix at each loss plateau, and the sharpness of the regime transition increases with width (Jacot et al., 2021). These saddle-to-saddle dynamics are provably aligned with the top singular directions of the gradient at each escape, a property tightly linked with the symmetries and homogeneity of the DLN architecture.

5. Layerwise Growth, Synchronization, and Nonlinearization Effects

The nonlinear interdependence of layers manifests as synchronization of norm growth and distinct phase-wise learning dynamics (Basu et al., 2019). In detail:

Layerwise norm growth is locked up to constants determined by (difference) invariants. Under orthogonal or Glorot initialization, this lock is perfect.
The learning trajectory passes through three phases: slow “drift,” rapid “explosive” transition, and final “saturation.”
Overparameterization (increased depth at fixed width) can accelerate the initial slow phase and induce a sharper transition, though the explosive burst necessitates careful tuning of learning rates.

Upon linearization of ReLU (or gated) nonlinear networks on a per-sample basis, it is found that local symmetry of norm growth persists within samples but is broken globally by cross-sample gating effects. As training progresses and samples cluster by label, symmetry is conjectured to be partially restored, elucidating the effect of nonlinearity in deep network learning (Basu et al., 2019).

6. Deep Linear Embeddings of Nonlinear Dynamical Systems

A separate but conceptually linked body of research concerns the use of deep networks to construct global linear representations of fundamentally nonlinear dynamical systems (Breunung et al., 2024, Ahmed et al., 2022). The central challenge is to find invertible transformations $\Phi$ and $\Phi^{-1}$ , parameterized as deep neural networks, such that: $\Phi \left( F^t(x_0) \right) = \exp(tA) \Phi(x_0), \quad \dot y = Ay,$ where $F^t$ is the nonlinear flow and $A$ is a low-dimensional linear operator (Breunung et al., 2024).

Key findings include:

For canonical classes of nonlinear systems—continuous families of periodic orbits, limit cycles, and systems with coexisting attractors—an explicit construction enables global linearization using a finite-dimensional linear system of dimension at most $N+1$ (where $N$ is the original phase space dimension) (Breunung et al., 2024).
The encoder and decoder maps $\Phi, \Phi^{-1}$ can be effectively learned using small, feedforward, hyperbolic tangent-based architectures, trained to machine precision via Levenberg–Marquardt optimization.
Empirical errors for such learned representations are consistently low (e.g., $\leq5\%$ for the nonlinear pendulum, $\leq6\%$ for forced Duffing), and the approach generalizes the local scope of classical analytic linearization by leveraging global nonlinear phase space structure (Breunung et al., 2024, Ahmed et al., 2022).
Extensions to higher-dimensional quasi-periodic tori or genuinely chaotic dynamics remain open, with expected need for additional coordinates per incommensurate frequency or infinite-dimensional (Koopman-style) lifts (Breunung et al., 2024).

In this context, DLNs function as nonlinear encoders/decoders for coordinate change, enabling global linear modeling in cases where analytic linearization is fundamentally limited.

7. Broader Implications and Limitations

Nonlinear dynamics in deep linear networks have far-reaching implications:

They provide an exactly solvable foundation for analyzing training-phase structure, layerwise synchronization, implicit complexity control, and initialization effects in modern deep learning (Saxe et al., 2013, Ji et al., 2018, Basu et al., 2019, Jacot et al., 2021).
Analytical tractability enables formal analysis of generalization dynamics, transfer learning benefits and negative interference, and the construction of non-gradient algorithms that can outperform traditional GD in certain regimes (Lampinen et al., 2018).
The techniques for global linearization of nonlinear systems via learned deep encoders demonstrate an overview between classical dynamical systems theory and modern deep learning architectures, with potential for broad application in control, identification, and scientific computing (Breunung et al., 2024, Ahmed et al., 2022).

However, several limitations remain:

The structure of the loss landscape and the phase transitions induced by initialization and architecture are specific to the linear or piecewise-linear setting.
Extensions to general nonlinearities, chaotic systems, or arbitrary global embeddings require fundamentally new mathematical approaches; many open questions remain, particularly concerning the tractability of high-dimensional, non-linearly gated or attention-based architectures.
Most direct analytic results pertain to continuous-time and infinitely wide limits; finite-sample, finite-width corrections require more nuanced treatment (Li et al., 2022).

Nonlinear dynamics in deep linear networks thus epitomize the subtlety and complexity of deep learning, illustrating how simple function classes can engender rich, highly nonlinear optimization phenomena and pose as tractable proxies for unraveling the implicit mechanisms at the heart of contemporary AI.