Papers
Topics
Authors
Recent
Search
2000 character limit reached

Orthogonal-Gradient AdamW Optimizer

Updated 19 February 2026
  • The method decouples correlated gradient updates by projecting new gradients onto the orthogonal complement of prior gradient directions.
  • It employs a radial-tangential decomposition that separates parameter norm growth from feature learning to reduce destabilizing oscillations.
  • Empirical results show substantial performance gains in streaming video and high-correlation scenarios compared to standard AdamW.

Orthogonal-Gradient AdamW refers to a class of optimizer modifications that extend AdamW by decorrelating or decomposing the update direction along geometric subspaces defined by the structure of recent gradients or parameter space orientation. Two principal and independently developed approaches exemplify this design: the orthogonal-gradient method proposed for streaming video learning (Han et al., 2 Apr 2025) and the decoupled radial-tangential dynamics in AdamO (Chen et al., 4 Feb 2026). Both address the fundamental limitations of standard AdamW in high-correlation settings but via distinct projection frameworks.

1. Geometric Motivation and Setting

AdamW, as with other adaptive optimizers, assumes successive mini-batch gradients are sufficiently independent, an assumption violated in streaming and sequential-data regimes. In these contexts, adjacent batches are highly correlated, yielding redundant gradient directions that cause inefficiency and, in some cases, the collapse of representation learning performance. The critical insight motivating orthogonal-gradient modifications is that the serial correlation of updates can be systematically attenuated by projecting the new gradient onto the orthogonal complement of the dominant subspace spanned by recent trajectories.

The radial-tangential decomposition presented in AdamO formalizes a complementary geometric principle. In deep networks, raw gradients often exhibit a significant component aligned with the current parameter vector, i.e., a radial direction responsible for norm growth, while the remaining tangential part governs feature learning. In AdamW, isotropic weight decay and unstructured preconditioning induce a "radial tug-of-war," leading to destabilizing oscillations that contaminate the variance estimate and hinder effective learning (Chen et al., 4 Feb 2026). Orthogonal projections enable a strict separation of these dynamics.

2. Orthogonal Gradient Construction: Mathematical Derivation

Let θtRd\theta_t \in \mathbb{R}^d denote model parameters at iteration tt. Given a mini-batch BtB_t, compute the raw gradient gt=θL(θt1;Bt)g_t = \nabla_\theta L(\theta_{t-1}; B_t). In streaming data, gtg_t aligns closely with prior gradients. The orthogonal-gradient approach introduces an Exponential Moving Average (EMA) of gradients:

ct=βct1+(1β)gt,c0=0,0β<1.c_t = \beta c_{t-1} + (1-\beta) g_t, \quad c_0 = 0, \quad 0 \leq \beta < 1.

The update is then projected off ct1c_{t-1}:

ut=gtgtTct1ct12ct1.u_t = g_t - \frac{g_t^T c_{t-1}}{\|c_{t-1}\|^2} c_{t-1}.

  • If gtg_t is orthogonal to ct1c_{t-1}, utgtu_t \approx g_t.
  • If gtg_t is collinear with ct1c_{t-1}, ut0u_t \approx 0.

In the AdamO scheme, gradients are instead decomposed relative to parameter vector ww as: gρ=g,ww,ww gθ=ggρ\begin{aligned} g^\rho &= \frac{\langle g, w \rangle}{\langle w, w \rangle} w \ g^\theta &= g - g^\rho \end{aligned} with adaptive step and moment statistics computed separately for each subspace (Chen et al., 4 Feb 2026).

3. Modified AdamW Update Rules

Orthogonal-Gradient AdamW modifies the standard update sequence by integrating orthogonalization as described above. The standard AdamW per-step procedure is:

  1. gt=θL(θt1;Bt)g_t = \nabla_\theta L(\theta_{t-1}; B_t)
  2. mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
  3. vt=β2vt1+(1β2)(gtgt)v_t = \beta_2 v_{t-1} + (1-\beta_2) (g_t \circ g_t)
  4. Bias correction: m^t\hat m_t, v^t\hat v_t
  5. Weight decay: θt1θt1ηλθt1\theta_{t-1} \gets \theta_{t-1} - \eta\lambda \theta_{t-1}
  6. Update: θt=θt1ηm^tv^t+ϵ\theta_t = \theta_{t-1} - \eta \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}

In the orthogonal-gradient variant (Han et al., 2 Apr 2025):

  • gtg_t in steps (2)-(4) is replaced by utu_t.
  • EMA ctc_t is always updated with the raw gtg_t, not utu_t.

In AdamO (Chen et al., 4 Feb 2026), updates operate on the decomposed subspaces:

  • Separate EMAs for gρg^\rho and gθg^\theta (mtρm_t^\rho, mtθm_t^\theta)
  • Second-moment estimate and Adam-style preconditioning only for gθg^\theta
  • SGD or curvature-adapted step size for gρg^\rho
  • Pure radial decay: weight decay contracts only along ww.

A summarized pseudocode for the approach in (Han et al., 2 Apr 2025):

1
2
3
4
5
6
7
8
9
10
for t in 1...T:
    g = gradient(theta)
    c = beta * c + (1-beta) * g
    u = g - (dot(g, c) / dot(c, c)) * c
    g = u
    m = beta1 * m + (1-beta1) * g
    v = beta2 * v + (1-beta2) * (g * g)
    m_hat, v_hat = bias_correction(m, v)
    theta -= eta * lambda * theta
    theta -= eta * m_hat / (sqrt(v_hat) + epsilon)
(Han et al., 2 Apr 2025).

4. Theoretical Properties and Geometric Interpretation

The orthogonal-gradient mechanism serves as a dynamic whitening operation, continually projecting out the redundant signal represented by the dominant subspace of past gradients. For streaming video, this mitigates the adverse effects of non-independent mini-batches, forcing updates to focus on "novel" directions, thus preserving the learning signal necessary for generalization. The approach degrades gracefully to conventional AdamW in IID conditions since the projection has negligible effect when gradient correlation is weak.

The AdamO framework applies similar geometric reasoning at the parameter level, eliminating radial oscillations that contaminate tangential preconditioning. Because radial and tangential directions are treated independently, noisy norm fluctuations no longer suppress genuine feature learning updates in the tangential subspace, yielding smoother and more effective learning dynamics (Chen et al., 4 Feb 2026).

5. Empirical Performance, Trade-Offs, and Overhead

Empirical results demonstrate that Orthogonal-Gradient AdamW provides substantial gains in settings characterized by strong temporal correlation between batches:

  • DoRA single-video pretraining: Standard AdamW fails (ImageNet top-1 \approx 6%, kNN \approx 1.8%), while Orthogonal-Gradient AdamW achieves ImageNet linear-probe \approx 64.5% and kNN \approx 51.8% (Han et al., 2 Apr 2025).
  • VideoMAE on multi-video datasets: Orthogonal-Gradient AdamW outperforms AdamW across both shuffled and temporally sequential training (e.g., on Something-Something-V2, sequential: AdamW \approx 16.4, Orthogonal-AdamW \approx 18.4).
  • Future-frame prediction: Consistent improvements in MSE and PSNRPSNR (up to +0.8+0.8 dB PSNRPSNR over AdamW and RMSProp).

The computational overhead consists of a single additional vector for the gradient EMA and a per-step projection, which is insignificant relative to the cost of forward and backward passes. No replay-buffer or sample shuffling is required, and the framework integrates easily with existing optimizer code. Similar efficiency is reported for AdamO, where only a small number of additional moment statistics and projections are maintained.

6. Extensions: Decoupled Orthogonal Dynamics and Adaptive Radial Stepping

AdamO (Chen et al., 4 Feb 2026) generalizes the orthogonal-gradient paradigm with further refinements:

  • Adaptive step sizing in the radial direction via a curvature proxy that tracks the squared change in gradients, slowing learning in high-curvature regions.
  • Architecture-aware modifications handle low-dimensional or scale-invariant parameters by defaulting to standard Adam behavior where geometry-aware updates are unnecessary.
  • Broader tuning insensitivity: AdamO's high-performing hyperparameter region is substantially wider than AdamW's.

Experiments confirm enhanced generalization and stability in both vision (e.g., CIFAR-100) and algorithmic tasks (modulo addition grokking), with lower gradient norm fluctuations and better loss-trajectory smoothness, demonstrating the importance of orthogonal dynamics decoupling for regularizing adaptive network optimizers (Chen et al., 4 Feb 2026).

7. Relation to Broader Optimizer Design

The incorporation of orthogonal-gradient projections situates these variants within a broader trend toward geometry-aware optimization. These techniques reflect increasing recognition that both the redundancy in gradient statistics (as in streaming domains) and the entanglement of parameter norm and direction (as in deep representations) undermine effective learning when left unchecked. Methods such as Orthogonal-Gradient AdamW and AdamO bypass these obstacles via principled subspace projections and decoupling, offering a tractable and broadly applicable enhancement to the class of adaptive optimizers built around AdamW. Their empirical and theoretical properties suggest that similar geometric interventions may benefit a range of applications where gradient statistics deviate from the IID assumption, or where expressive feature learning contends with destabilizing dynamics in parameter norms (Han et al., 2 Apr 2025, Chen et al., 4 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Orthogonal-Gradient AdamW.