How do simple rotations affect the implicit bias of Adam?

Published 27 Oct 2025 in cs.LG | (2510.23804v1)

Abstract: Adaptive gradient methods such as Adam and Adagrad are widely used in machine learning, yet their effect on the generalization of learned models -- relative to methods like gradient descent -- remains poorly understood. Prior work on binary classification suggests that Adam exhibits a ``richness bias,'' which can help it learn nonlinear decision boundaries closer to the Bayes-optimal decision boundary relative to gradient descent. However, the coordinate-wise preconditioning scheme employed by Adam renders the overall method sensitive to orthogonal transformations of feature space. We show that this sensitivity can manifest as a reversal of Adam's competitive advantage: even small rotations of the underlying data distribution can make Adam forfeit its richness bias and converge to a linear decision boundary that is farther from the Bayes-optimal decision boundary than the one learned by gradient descent. To alleviate this issue, we show that a recently proposed reparameterization method -- which applies an orthogonal transformation to the optimization objective -- endows any first-order method with equivariance to data rotations, and we empirically demonstrate its ability to restore Adam's bias towards rich decision boundaries.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that small orthogonal rotations can shift Adam’s bias from complex nonlinear to inferior linear decision boundaries.
The study employs both theoretical analysis and empirical experiments to reveal Adam’s vulnerability to data orientations compared to gradient descent.
The paper introduces an EGOP-based reparameterization that restores rotational invariance, thereby improving Adam’s generalization performance.

The Effect of Simple Data Rotations on the Implicit Bias of Adam

Introduction

The paper "How do simple rotations affect the implicit bias of Adam?" (2510.23804) presents a detailed theoretical and empirical study on how orthogonal (rotation) transformations of the feature space impact the implicit bias of coordinate-wise adaptive gradient methods, particularly Adam, in the context of binary classification with two-layer ReLU neural networks. It extends prior analysis of optimizer-induced bias and reveals that Adam's well-documented bias towards nonlinear (richer) decision boundaries, claimed to often approach Bayes-optimality more closely than gradient descent (GD), is not invariant to seemingly innocuous data rotations. In fact, small rotations of the input can substantially degrade the generalization performance of Adam—often producing linear predictors with inferior risk relative to those found by GD, whose bias is orthogonally invariant. The paper further demonstrates that this fragility can be remedied by a data-driven reparameterization derived from the expected gradient outer product (EGOP) matrix, thereby restoring rotation equivariance to adaptive methods.

Implicit Bias of Adam and Sensitivity to Rotations

The "richness bias" of Adam and related adaptive methods has been characterized in recent theoretical work, showing that for binary classification tasks where the Bayes-optimal decision boundary is nonlinear, Adam optimizers converge to nonlinear predictors closer to the Bayes-optimal solution than those produced by standard (stochastic) gradient descent [vasudeva2025rich]. This effect is most pronounced in synthetic settings with carefully controlled data distributions, such as mixtures of labeled Gaussians, and network architectures with fixed second-layer weights.

However, Adam's coordinate-wise adaptation is fundamentally non-equivariant to rotations in parameter or data space, due to its reliance on diagonal preconditioning. This lack of rotational invariance means that the implicit solution found by Adam can change dramatically if the input is rotated, even by a small amount. The authors first replicate the canonical setup of [vasudeva2025rich], demonstrating that when $\gamma = 0$ (no rotation), Adam and SignGD (an Adam variant with $\beta_1 = \beta_2 = 0$ ) produce nonlinear boundaries approximating the Bayes predictor (Figure 1).

Figure 1: Data rotation $\gamma = 0$ , learning with base algorithms. Adam and SignGD yield nonlinear, rich predictors compared to GD.

Upon the introduction of a small rotation ( $\gamma = \frac{\pi}{32}$ ), Adam and SignGD's bias abruptly shifts: both yield linear decision boundaries (specifically, $x_1 = -x_2$ ), and the learned classifier generalizes worse than the boundary found by gradient descent—which remains optimal among linear predictors and rotates together with the data distribution (Figure 2).

Figure 2: $\gamma = 0$ , Adam/SignGD, base algorithms. Rotations of the dataset result in a failure of Adam’s richness bias and the emergence of a linear, sub-optimal decision boundary.

These results are formalized via an explicit construction of a class of data distributions and a detailed analysis of the SignGD update rule. The paper shows analytically that the gradient's sign structure, under mild assumptions on the rotation and data parameters, forces convergence to an axis-aligned linear solution (see Theorem~1 and its supporting Lemmas).

Theoretical Results: Invariance and Sensitivity

The crucial theoretical contribution is twofold. First, it demonstrates that Adam's nonlinear decision bias is not invariant under orthogonal rotations: for any nontrivial rotation, there exists a data distribution for which Adam returns a linear classifier whose generalization metric is strictly inferior to that of the Bayes classifier and also, in many instances, to that of GD. Second, the authors show that this vulnerability is not a mere artifact of the idealized setting but persists in practical neural architectures including networks with bias parameters and trainable outer layers (see empirical figures and supplementary analysis). Adam and coordinate-wise methods' lack of rotational equivariance thus represents a substantial fragility in geometric robustness and implicit regularization.

EGOP-Reparameterization: Rotation-Equivariant Adaptive Optimization

To address this issue, the authors propose leveraging a linear reparameterization of the model parameters informed by the EGOP matrix—a covariance-type object capturing average gradient outer products under the training data distribution. This EGOP-based transformation orthogonalizes the optimization landscape, enabling adaptive methods to operate in a parameter basis where rotations in data space do not affect the trajectory or implicit bias of the optimizer. The reparameterization framework is also theoretically justified: any deterministic first-order method, when applied to the EGOP-reparameterized problem, becomes equivariant to rotations. This restoration of rotational invariance translates, empirically, to Adam recovering its richness bias for all rotation angles $\gamma$ (Figure 3 and corresponding generalization bounds).

Figure 3: Nonlinear boundaries returned by EGOP-reparameterized Adam are equivariant under rotation and robustly approximate the Bayes-optimal solution across rotation angles.

Related algorithms including Shampoo and SOAP, which employ richer preconditioners, also fall within this invariance framework, as evidenced in the extended discussion and appendices.

Empirical Validation and Practical Observations

Numerical experiments corroborate the theoretical analysis. The robustness of the failure modes and the corrective power of EGOP-reparameterization is confirmed across a range of architectural and data-modeling details, including networks with bias, different hidden dimensions, and estimates of the EGOP matrix from finite samples. Notably, increasing Adam's $\epsilon$ parameter (intended to improve numerical stability) does not, by itself, restore the richness bias in the absence of reparameterization, as shown in further figures and ablation studies.

Figure 4: Empirical demonstration that varying $\epsilon$ in Adam/SignGD without reparameterization fails to restore rotation invariance or nonlinearity in learned decision boundaries.

Broader Implications and Future Directions

This paper highlights that Adam's celebrated ability to find complex, highly expressive predictors is highly contingent on the coordinate system in which data is presented. In practical learning scenarios where real-world data is rarely axis-aligned, this sensitivity represents a theoretical and applied limitation for coordinate-wise adaptive optimization. The results motivate further scrutiny of optimizer–data interaction beyond the narrow context of SGD vs. Adam, pushing toward a richer geometric understanding and the use of structure-aware or data-adaptive preconditioners for robust generalization.

Rotation-equivariant variants such as EGOP-reparameterized Adam (and analogously, Shampoo, SOAP) suggest a general procedure for removing undesirable artifacts induced by coordinate system choices. Future research directions include extending such equivariant frameworks to more general classes of transformations beyond rotations (e.g., adversarially chosen input representations), investigating their effects in overparametrized or highly corrupted regimes, and understanding the trade-offs with optimization efficiency and expressivity in large-scale neural architectures.

Conclusion

The work rigorously demonstrates that Adam's implicit bias towards complex, non-linear decision boundaries is not a generic property; rather, it is a fragile artifact of input axis alignment. Even minor data rotations can provoke substantial failures in generalization, contradicting prior assumptions about the optimizer's suitability for general machine learning scenarios. EGOP-based reparameterization and similar structure-aware approaches provide a principled and practical route to curing this sensitivity, equipping adaptive gradient methods with geometric invariance and preserving their desirable inductive biases. The results demand careful consideration of both algorithmic design and data representation in modern neural network training.