Orthogonal-Gradient AdamW Optimizer
- The method decouples correlated gradient updates by projecting new gradients onto the orthogonal complement of prior gradient directions.
- It employs a radial-tangential decomposition that separates parameter norm growth from feature learning to reduce destabilizing oscillations.
- Empirical results show substantial performance gains in streaming video and high-correlation scenarios compared to standard AdamW.
Orthogonal-Gradient AdamW refers to a class of optimizer modifications that extend AdamW by decorrelating or decomposing the update direction along geometric subspaces defined by the structure of recent gradients or parameter space orientation. Two principal and independently developed approaches exemplify this design: the orthogonal-gradient method proposed for streaming video learning (Han et al., 2 Apr 2025) and the decoupled radial-tangential dynamics in AdamO (Chen et al., 4 Feb 2026). Both address the fundamental limitations of standard AdamW in high-correlation settings but via distinct projection frameworks.
1. Geometric Motivation and Setting
AdamW, as with other adaptive optimizers, assumes successive mini-batch gradients are sufficiently independent, an assumption violated in streaming and sequential-data regimes. In these contexts, adjacent batches are highly correlated, yielding redundant gradient directions that cause inefficiency and, in some cases, the collapse of representation learning performance. The critical insight motivating orthogonal-gradient modifications is that the serial correlation of updates can be systematically attenuated by projecting the new gradient onto the orthogonal complement of the dominant subspace spanned by recent trajectories.
The radial-tangential decomposition presented in AdamO formalizes a complementary geometric principle. In deep networks, raw gradients often exhibit a significant component aligned with the current parameter vector, i.e., a radial direction responsible for norm growth, while the remaining tangential part governs feature learning. In AdamW, isotropic weight decay and unstructured preconditioning induce a "radial tug-of-war," leading to destabilizing oscillations that contaminate the variance estimate and hinder effective learning (Chen et al., 4 Feb 2026). Orthogonal projections enable a strict separation of these dynamics.
2. Orthogonal Gradient Construction: Mathematical Derivation
Let denote model parameters at iteration . Given a mini-batch , compute the raw gradient . In streaming data, aligns closely with prior gradients. The orthogonal-gradient approach introduces an Exponential Moving Average (EMA) of gradients:
The update is then projected off :
- If is orthogonal to , .
- If is collinear with , .
In the AdamO scheme, gradients are instead decomposed relative to parameter vector as: with adaptive step and moment statistics computed separately for each subspace (Chen et al., 4 Feb 2026).
3. Modified AdamW Update Rules
Orthogonal-Gradient AdamW modifies the standard update sequence by integrating orthogonalization as described above. The standard AdamW per-step procedure is:
- Bias correction: ,
- Weight decay:
- Update:
In the orthogonal-gradient variant (Han et al., 2 Apr 2025):
- in steps (2)-(4) is replaced by .
- EMA is always updated with the raw , not .
In AdamO (Chen et al., 4 Feb 2026), updates operate on the decomposed subspaces:
- Separate EMAs for and (, )
- Second-moment estimate and Adam-style preconditioning only for
- SGD or curvature-adapted step size for
- Pure radial decay: weight decay contracts only along .
A summarized pseudocode for the approach in (Han et al., 2 Apr 2025):
1 2 3 4 5 6 7 8 9 10 |
for t in 1...T: g = gradient(theta) c = beta * c + (1-beta) * g u = g - (dot(g, c) / dot(c, c)) * c g = u m = beta1 * m + (1-beta1) * g v = beta2 * v + (1-beta2) * (g * g) m_hat, v_hat = bias_correction(m, v) theta -= eta * lambda * theta theta -= eta * m_hat / (sqrt(v_hat) + epsilon) |
4. Theoretical Properties and Geometric Interpretation
The orthogonal-gradient mechanism serves as a dynamic whitening operation, continually projecting out the redundant signal represented by the dominant subspace of past gradients. For streaming video, this mitigates the adverse effects of non-independent mini-batches, forcing updates to focus on "novel" directions, thus preserving the learning signal necessary for generalization. The approach degrades gracefully to conventional AdamW in IID conditions since the projection has negligible effect when gradient correlation is weak.
The AdamO framework applies similar geometric reasoning at the parameter level, eliminating radial oscillations that contaminate tangential preconditioning. Because radial and tangential directions are treated independently, noisy norm fluctuations no longer suppress genuine feature learning updates in the tangential subspace, yielding smoother and more effective learning dynamics (Chen et al., 4 Feb 2026).
5. Empirical Performance, Trade-Offs, and Overhead
Empirical results demonstrate that Orthogonal-Gradient AdamW provides substantial gains in settings characterized by strong temporal correlation between batches:
- DoRA single-video pretraining: Standard AdamW fails (ImageNet top-1 6%, kNN 1.8%), while Orthogonal-Gradient AdamW achieves ImageNet linear-probe 64.5% and kNN 51.8% (Han et al., 2 Apr 2025).
- VideoMAE on multi-video datasets: Orthogonal-Gradient AdamW outperforms AdamW across both shuffled and temporally sequential training (e.g., on Something-Something-V2, sequential: AdamW 16.4, Orthogonal-AdamW 18.4).
- Future-frame prediction: Consistent improvements in MSE and (up to dB over AdamW and RMSProp).
The computational overhead consists of a single additional vector for the gradient EMA and a per-step projection, which is insignificant relative to the cost of forward and backward passes. No replay-buffer or sample shuffling is required, and the framework integrates easily with existing optimizer code. Similar efficiency is reported for AdamO, where only a small number of additional moment statistics and projections are maintained.
6. Extensions: Decoupled Orthogonal Dynamics and Adaptive Radial Stepping
AdamO (Chen et al., 4 Feb 2026) generalizes the orthogonal-gradient paradigm with further refinements:
- Adaptive step sizing in the radial direction via a curvature proxy that tracks the squared change in gradients, slowing learning in high-curvature regions.
- Architecture-aware modifications handle low-dimensional or scale-invariant parameters by defaulting to standard Adam behavior where geometry-aware updates are unnecessary.
- Broader tuning insensitivity: AdamO's high-performing hyperparameter region is substantially wider than AdamW's.
Experiments confirm enhanced generalization and stability in both vision (e.g., CIFAR-100) and algorithmic tasks (modulo addition grokking), with lower gradient norm fluctuations and better loss-trajectory smoothness, demonstrating the importance of orthogonal dynamics decoupling for regularizing adaptive network optimizers (Chen et al., 4 Feb 2026).
7. Relation to Broader Optimizer Design
The incorporation of orthogonal-gradient projections situates these variants within a broader trend toward geometry-aware optimization. These techniques reflect increasing recognition that both the redundancy in gradient statistics (as in streaming domains) and the entanglement of parameter norm and direction (as in deep representations) undermine effective learning when left unchecked. Methods such as Orthogonal-Gradient AdamW and AdamO bypass these obstacles via principled subspace projections and decoupling, offering a tractable and broadly applicable enhancement to the class of adaptive optimizers built around AdamW. Their empirical and theoretical properties suggest that similar geometric interventions may benefit a range of applications where gradient statistics deviate from the IID assumption, or where expressive feature learning contends with destabilizing dynamics in parameter norms (Han et al., 2 Apr 2025, Chen et al., 4 Feb 2026).