Papers
Topics
Authors
Recent
Search
2000 character limit reached

Arc Gradient Descent (ArcGD) Optimization

Updated 30 December 2025
  • Arc Gradient Descent is a phase-aware optimization framework that leverages arc-length geometry to compute bounded and controlled gradient updates.
  • It adjusts update dynamics using saturation, transition, and floor terms to mitigate exploding and vanishing gradients in high-dimensional settings.
  • Empirical tests on benchmarks like the Rosenbrock function and CIFAR image classification demonstrate its robust convergence and superior stability compared to traditional optimizers.

Arc Gradient Descent (ArcGD) is an optimization framework formulated to provide mathematically controlled, phase-aware step sizes in gradient-based minimization. Its key innovations include explicit elementwise update bounds and user-tunable step-size dynamics that address both exploding and vanishing gradients, thereby improving stability and generalization for highly non-convex and high-dimensional objectives, particularly in deep learning and geometric stress-test settings (Verma et al., 7 Dec 2025, Mishra et al., 2023).

1. Mathematical Foundations and Update Rule

ArcGD defines the step update by considering the arc-length geometry of the objective function. In scalar form, let f(x)f(x) be a differentiable function with gx=f(x)g_x = f'(x). The method utilizes the approximation

ds=1+(f(x))2dx,ds = \sqrt{1 + (f'(x))^2}\,dx,

leading to the fundamental normalised gradient

Tx=gx1+gx2(1,1).T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).

The most elementary update is then

Δx=aTx,\Delta x = -a\,T_x,

where aa is a positive ceiling, enforcing Δxa|\Delta x| \le a.

To robustly traverse distinct optimization regimes (“phases”), ArcGD introduces two additional terms: a transition coefficient bb and a floor cc,

Δx=aTx    bTx(1Tx)    csign(Tx)(1Tx).\Delta x = -a\,T_x\; -\; b\,T_x\,(1-|T_x|)\; -\; c\,\mathrm{sign}(T_x)\,(1-|T_x|).

Consequently:

  • The saturation term aTx-a\,T_x dominates for Tx1|T_x| \approx 1, enforcing upper-bounded steps in steep, unstable regions.
  • The transition term smooths the step-size in intermediate-gradient regions, giving more linear “GD-like” behavior.
  • The floor term csign(Tx)(1Tx)-c\,\mathrm{sign}(T_x)\,(1-|T_x|) guarantees nonzero progress even when gradients largely vanish.

By parametric control of (a,b,c)(a, b, c), users directly govern the ceiling, transition acceleration, and minimum update. These terms are interpreted as controlling “high”, “transition”, and “vanishing” gradient phases, respectively. The composite coefficient (a+bc)(a + b - c) acts as the effective learning rate αeff\alpha_{\text{eff}} (Verma et al., 7 Dec 2025).

For vector-valued objectives and high-dimensional xRnx\in\mathbb{R}^n, the update is elementwise: Δxi=aTibTi(1Ti)csign(Ti)(1Ti),\Delta x_i = -a\,T_i - b\,T_i\,(1-|T_i|) - c\,\mathrm{sign}(T_i)\,(1-|T_i|), with Ti=gi/1+gi2T_i = g_i/\sqrt{1+g_i^2}. Optionally, a moving-average filtered gradient mt=βmt1+(1β)gtm_t = \beta\,m_{t-1} + (1-\beta)\,g_t may be used for robustness in noisy settings.

2. Adaptive Dynamics and Angle-Based Learning Integration

An alternative ArcGD scheme defines learning rate adaptation via the local geometry of gradient directions. Specifically, the method performs:

  1. At xtx_t, computes the descent direction gt=f(xt)g_t = -\nabla f(x_t).
  2. Probes at xt=xthptx_t' = x_t - h\,p_t along a direction ptp_t orthogonal to gtg_t.
  3. Calculates the gradient at xtx_t' as gtg_t^\perp.
  4. Computes the angle θt\theta_t between gtg_t and gtg_t^\perp:

θt=arccos(gtgtgtgt)+ϵ.\theta_t = \arccos\left(\frac{g_t \cdot g_t^\perp}{\|g_t\|\,\|g_t^\perp\|}\right) + \epsilon.

  1. Sets dt=hcot(θt)d_t = h\cot(\theta_t) as the candidate step-size.
  2. Smooths dtd_t with exponential moving average to yield sts_t.
  3. If dt<std_t < s_t, applies a boosting rule dt2dtd_t \leftarrow 2 d_t.
  4. Final update:

xt+1=xt+dtgtgt.x_{t+1} = x_t + d_t\,\frac{g_t}{\|g_t\|}.

This approach enables direct geometric sensitivity to local loss surface curvature, adapting per-step magnitudes without explicit gradient norm scaling (Mishra et al., 2023).

3. Algorithmic Procedure and Variants

The following pseudocode summarizes n-dimensional ArcGD (phase-aware formulation):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
x = x0
if use_momentum:
    m = 0
for t in range(T_max):
    g = grad(f, x)
    if use_momentum:
        m = beta * m + (1 - beta) * g
        g_eff = m
    else:
        g_eff = g
    for i in range(n):
        T_i = g_eff[i] / sqrt(1 + g_eff[i]**2)
        eta_mid = b * (1 - abs(T_i)) * T_i
        c_adapt = min(c, c * abs(T_i) / (1 - abs(T_i)))
        eta_low = c_adapt * sign(T_i) * (1 - abs(T_i))
        dx[i] = -a * T_i - eta_mid - eta_low
    x = x + dx
return x
Setting b=0b=0 transforms ArcGD into a variant matching the Lion optimizer’s structure, explicitly relating ArcGD’s update decomposition to sign-based adaptive methods (Verma et al., 7 Dec 2025).

4. Theoretical Guarantees and Properties

In angle-based ArcGD, convergence is proven under standard convexity and smoothness conditions. If ff is convex and differentiable and f\nabla f is LL-Lipschitz, then for any probe-step hth_t satisfying

htf(xt)Lcotθt,h_t \leq \frac{\|\,\nabla f(x_t)\|}{L\,\cot \theta_t},

each iteration yields

f(xt+1)f(xt)12Lf(xt)2,f(x_{t+1}) \leq f(x_t) - \frac{1}{2L} \|\nabla f(x_t)\|^2,

establishing monotonic decrease and satisfaction of the Wolfe (Armijo) sufficient decrease condition with c1=1/(2L)c_1 = 1/(2L). This supports O(1/ϵ)O(1/\epsilon) convergence to an ϵ\epsilon-optimal solution under convexity (Mishra et al., 2023). For the bounded update, phase-aware ArcGD, formal global non-convex convergence proofs remain open, but the explicit magnitude control and minimum-step enforcement aim to ensure stability in classic high-curvature pitfalls.

5. Empirical Performance Across Domains

ArcGD has been comprehensively evaluated across geometric and practical benchmarks:

  • Rosenbrock function (up to 50,000 dimensions): On the stochastic Rosenbrock with high curvature and extreme ill-conditioning, ArcGD consistently converged, outperforming Adam in all tested high-dimensional regimes for both matched and default learning rates. At n=1000n=1000, ArcGD averaged 9,197 iterations (distance 1.11×1061.11\times10^{-6}); Adam required 15,658 iterations (distance 2.71×1042.71\times10^{-4}). Adam failed completely at 50,000D in some configurations, while ArcGD remained robust (Verma et al., 7 Dec 2025).
  • Image Classification (CIFAR-10, CIFAR-100, mini-ImageNet): Across diverse MLPs (depth 1–5, up to $5.5$ million parameters) and standard CNNs (ResNet, DenseNet, VGG, EfficientNet), ArcGD delivered superior or competitive test accuracy. On CIFAR-10 at 20,000 iterations, ArcGD achieved 50.7%50.7\% test accuracy vs. Adam 46.8%46.8\%, AdamW 46.6%46.6\%, SGD 49.6%49.6\%, and Lion 43.3%43.3\%. ArcGD won or tied in 6 of 8 MLP architectures, continuing to improve where others regressed with extended training, thus showing intrinsic resistance to overfitting and robustness under long-horizon optimization (Verma et al., 7 Dec 2025, Mishra et al., 2023).
  • Empirical stability: On deep ResNet/DenseNet/EfficientNet, the angle-based variant achieved the highest top-1 accuracy in all early training windows and maintained or improved performance in long epochs, even as Adam-based methods plateaued.

6. Critical Strengths, Limitations, and Use Guidance

Strengths:

  • Explicit elementwise upper bounds (aa) and minimum floors (cc) protect against divergent or stagnant updates across all phases of the optimization landscape.
  • Phase-aware decomposition, via parameters (a,b,c)(a,b,c), accommodates both aggressive search and conservative convergence, with direct user control.
  • Empirical superiority in both synthetic and real-world, high-dimensional, and highly non-convex scenarios.

Limitations:

  • Introduces additional hyperparameters beyond classical schemes; careful selection of (a,b,c)(a, b, c) is necessary for optimal performance.
  • Per-iteration computational cost is increased due to elementwise terms and, in angle-based variants, the requirement of two gradient computations per step.
  • No global, non-convex theoretical convergence guarantees currently available; theoretical analysis of such regimes remains as future work (Verma et al., 7 Dec 2025).

Recommended settings: (a,b,c)=(0.01,0.001,0.0001)(a, b, c) = (0.01, 0.001, 0.0001) and β=0.9\beta=0.9 serve as robust defaults. For speed-critical or Lion-style use, b=0b=0. Adaptive cc is advised when tackling extremely vanishing gradients.

ArcGD generalizes classical gradient descent and can specialize to a Lion-like optimizer under specific parameter constraints (b=0,a=c=γ1b=0,a=c=\gamma\ll1). The connection arises by interpreting the signed, elementwise-normalized update as sign-based momentum with linear bias, yielding

Δx=γsign(EMA(g))+γEMA(g)\Delta x = -\gamma\,\mathrm{sign}(\mathrm{EMA}(g)) + \gamma\,\mathrm{EMA}(g)

in the limit. Such relations elucidate ArcGD’s applicability as a parent class of controlled, phase-aware optimizers with tunable adaptivity. The potential for momentum and noise-robust variants, as well as further integration with curvature-sensitive adaptive schemes, suggests a broad and flexible optimization landscape (Verma et al., 7 Dec 2025).


Arc Gradient Descent represents a phase-aware, geometrically grounded reformulation of standard gradient methods. Its capacity to interpolate between safe, bounded step dynamics, and aggressive, minimum-progress regimes, both theoretically and empirically robust, positions it as a significant and extensible contribution to modern optimization for deep learning and complex numerical landscapes (Verma et al., 7 Dec 2025, Mishra et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Arc Gradient Descent (ArcGD).