Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures

Published 23 Dec 2025 in cs.LG | (2512.20607v1)

Abstract: Neural networks trained with gradient descent often learn solutions of increasing complexity over time, a phenomenon known as simplicity bias. Despite being widely observed across architectures, existing theoretical treatments lack a unifying framework. We present a theoretical framework that explains a simplicity bias arising from saddle-to-saddle learning dynamics for a general class of neural networks, incorporating fully-connected, convolutional, and attention-based architectures. Here, simple means expressible with few hidden units, i.e., hidden neurons, convolutional kernels, or attention heads. Specifically, we show that linear networks learn solutions of increasing rank, ReLU networks learn solutions with an increasing number of kinks, convolutional networks learn solutions with an increasing number of convolutional kernels, and self-attention models learn solutions with an increasing number of attention heads. By analyzing fixed points, invariant manifolds, and dynamics of gradient descent learning, we show that saddle-to-saddle dynamics operates by iteratively evolving near an invariant manifold, approaching a saddle, and switching to another invariant manifold. Our analysis also illuminates the effects of data distribution and weight initialization on the duration and number of plateaus in learning, dissociating previously confounding factors. Overall, our theory offers a framework for understanding when and why gradient descent progressively learns increasingly complex solutions.

Abstract PDF Upgrade to Chat

Summary

The paper reveals that saddle-to-saddle dynamics in loss landscapes explains how neural networks progress from simple to complex solutions via fixed point embeddings.
The paper demonstrates that invariant manifolds shape effective network width, causing stage-like training plateaus influenced by initialization and data spectra.
The paper unifies analyses across linear and nonlinear architectures, offering practical insights for optimizing design and training protocols.

Saddle-to-Saddle Dynamics and Simplicity Bias in Neural Network Training

Abstract and Context

This paper presents a comprehensive theoretical framework attributing the "simplicity bias" observed during neural network training to a universal mechanism termed saddle-to-saddle dynamics. By analyzing fixed points, invariant manifolds, and gradient descent trajectories for a general class of networks—including fully-connected, convolutional, and self-attention architectures—it elucidates why learning progresses toward increasingly complex solutions. The operative notion of simplicity here is tied to minimal expressivity: solutions that require only a few effective units (neurons, kernels, attention heads).

Fixed Point Embedding and Loss Landscape Structure

A principal result is that solutions learned by narrower networks are recursively embedded as fixed points (often saddles) in the loss landscapes of wider networks. Explicitly, Theorem 1 generalizes previous observations ([Fukumizu & Amari, 2000]) to show that for a large class of architectures, any fixed point in a width- $(H-1)$ network gives rise to corresponding fixed points in a width- $H$ network by natural weight extensions. The embedding constructions depend on activation properties (generic, zero map for certain weights, degree-1 homogeneity, linearity), but always preserve the functional map of the narrow network.

This hierarchy produces nested saddles through which gradient descent trajectories pass sequentially, yielding the stage-like training plateaus and abrupt loss decreases commonly observed in practice. These stages correspond to solutions expressible by progressively wider subnetworks, thus giving a principled meaning to simplicity bias as the number of effective units required to implement the learned map.

Invariant Manifolds and Effective Network Width

The paper proves for this class of networks (Theorem 3) that invariant manifolds exist: subspaces of the parameter space whereby the network implements a function expressible by a narrower subarchitecture. Such manifolds arise via constraints such as weight sharing, proportionality, or linear dependency among units. During training, network weights evolve along these invariant manifolds, maintaining lower effective functional complexity than the actual width, until a transition is triggered that permits increased complexity.

This theoretical insight provides a geometric mechanism for incremental complexity acquisition in learning: by breaking constraints one-by-one, effective width increases, aligning with abrupt transitions in the loss landscape and corresponding jumps in expressivity.

Mechanisms of Saddle-to-Saddle Dynamics

Linear Networks: Timescale Separation by Data Correlation

For networks linear in the weights, the dynamics in early training are governed by the singular values of the input-output data correlation matrix. Distinct singular values produce a timescale separation whereby some directions in weight space grow rapidly, aligning with dominant data directions, while others grow slowly. This timescale separation causes plateaus: the weights remain near invariant manifolds defined by low-rank (e.g., rank-one) solutions until a sufficient direction grows to allow escape to higher-rank manifolds and more complex function classes. This mechanism is formalized in Theorem 4, providing a precise description of the rate at which network weights align with top singular vectors.

Quadratic and Higher-Degree Networks: Timescale Separation by Initialization

For architectures where the activation is a homogeneous polynomial of degree $p \geq 2$ (e.g., quadratic as in self-attention, certain polynomial nets), the relevant separation is between units, not directions. With small isotropic initialization, the largest-initialized unit grows much faster to dominate, driving the network to operate as if it were narrower. Only as subsequent units break out of their initial constraints does the effective width increase. The timing and presence of plateaus depend on the spread of initial values. Proposition 5 details this "rich-get-richer" effect under quadratic dynamics.

For general nonlinearity, a Taylor expansion about zero initialization identifies the lowest-order term as determining early dynamics, possibly resulting in either timescale separations or smooth convergence depending on architectural and initialization specifics.

Predictive Implications for Architecture and Training Protocols

The framework is validated and refined through both theoretical analysis and simulations:

Network Width: For linear networks, increasing width has negligible impact provided coverage of all required directions; for quadratic/self-attention architectures, increased width decreases plateau duration due to initialization gaps narrowing.
Data Spectrum: Plateaus are governed by the singular value structure of the data. Power-law decay of singular values leads to identifiable plateau durations.
Initialization: Small or structured initialization (e.g., low-rank) promotes pronounced saddle-to-saddle dynamics; larger or more isotropic initialization shortens or eliminates plateaus and alters the landscape traversed by learning.
Deep Networks and Architectural Modifications: The embedding principle and invariant manifold arguments extend to multilayer networks, including cases with skip connections, which further modulate learning dynamics by effectively reducing network depth during certain training phases.

An unexpected result is that exponential loss decay in some regimes does not necessarily imply lazy learning; the theory explains such behavior in networks initialized directly on invariant manifolds of appropriate width.

Relation to Inductive Bias, Simplicity, and Generalization

The dynamical simplicity bias described here complements but is distinct from the stationary simplicity bias (the tendency for random weights or final training solutions to be simple due to high-volume occupation in parameter space—see [Mingard et al., 2025]). The dynamical variant is process-dependent, governed by initialization, architecture, and data, leading to evolutionary recruitment of network expressivity. This bias is sometimes beneficial (favoring generalizable or sparse solutions), though may also impede optimization or generalization when more complex feature representation is requisite ([Shah et al., 2020], [Petrini et al., 2022]).

Furthermore, distinguishing distributed (polysemantic) from localized (monosemantic) feature learning is possible via the types of invariant manifolds and transition mechanisms the framework identifies.

Broader Theoretical and Practical Implications

The framework advances understanding of incremental learning dynamics, subsuming earlier results for individual architectures into a unified theory. It predicts when plateaued, stage-like learning should occur and provides diagnostic tools for assessing the effective width and complexity of solutions at each training phase.

The implications extend to the design of architectures and initialization schemes for improved sample efficiency, feature learning, and optimization stability. The underlying permutation symmetry exploited may also generalize to recurrent architectures, unsupervised learning, and alternative training rules.

Open questions concern the exhaustiveness of the classes of invariant manifolds and fixed points described, the Markovianity of saddle transitions, and the identification of new mechanisms in other architectures or learning regimes.

Conclusion

The paper presents a unified theoretical account for why gradient descent drives neural networks toward increasingly complex solutions via saddle-to-saddle transitions. By grounding simplicity bias in the geometry of loss landscapes—embedded fixed points and invariant manifolds—it enables precise prediction and control of learning dynamics across a range of neural architectures. The distinction between mechanisms for linear versus nonlinear architectures highlights the dependency on activation properties, data spectra, and initialization. The results suggest practical guidelines for architecture and training protocol design, and open rich avenues for further investigation into learning dynamics, representation formation, and inductive bias in deep learning.