Papers
Topics
Authors
Recent
Search
2000 character limit reached

Arc Gradient Descent (ArcGD) Optimization

Updated 30 December 2025
  • Arc Gradient Descent is a phase-aware optimization framework that leverages arc-length geometry to compute bounded and controlled gradient updates.
  • It adjusts update dynamics using saturation, transition, and floor terms to mitigate exploding and vanishing gradients in high-dimensional settings.
  • Empirical tests on benchmarks like the Rosenbrock function and CIFAR image classification demonstrate its robust convergence and superior stability compared to traditional optimizers.

Arc Gradient Descent (ArcGD) is an optimization framework formulated to provide mathematically controlled, phase-aware step sizes in gradient-based minimization. Its key innovations include explicit elementwise update bounds and user-tunable step-size dynamics that address both exploding and vanishing gradients, thereby improving stability and generalization for highly non-convex and high-dimensional objectives, particularly in deep learning and geometric stress-test settings (Verma et al., 7 Dec 2025, Mishra et al., 2023).

1. Mathematical Foundations and Update Rule

ArcGD defines the step update by considering the arc-length geometry of the objective function. In scalar form, let f(x)f(x) be a differentiable function with gx=f(x)g_x = f'(x). The method utilizes the approximation

ds=1+(f(x))2dx,ds = \sqrt{1 + (f'(x))^2}\,dx,

leading to the fundamental normalised gradient

Tx=gx1+gx2(1,1).T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).

The most elementary update is then

Δx=aTx,\Delta x = -a\,T_x,

where aa is a positive ceiling, enforcing Δxa|\Delta x| \le a.

To robustly traverse distinct optimization regimes (“phases”), ArcGD introduces two additional terms: a transition coefficient bb and a floor cc,

Δx=aTx    bTx(1Tx)    csign(Tx)(1Tx).\Delta x = -a\,T_x\; -\; b\,T_x\,(1-|T_x|)\; -\; c\,\mathrm{sign}(T_x)\,(1-|T_x|).

Consequently:

  • The saturation term gx=f(x)g_x = f'(x)0 dominates for gx=f(x)g_x = f'(x)1, enforcing upper-bounded steps in steep, unstable regions.
  • The transition term smooths the step-size in intermediate-gradient regions, giving more linear “GD-like” behavior.
  • The floor term gx=f(x)g_x = f'(x)2 guarantees nonzero progress even when gradients largely vanish.

By parametric control of gx=f(x)g_x = f'(x)3, users directly govern the ceiling, transition acceleration, and minimum update. These terms are interpreted as controlling “high”, “transition”, and “vanishing” gradient phases, respectively. The composite coefficient gx=f(x)g_x = f'(x)4 acts as the effective learning rate gx=f(x)g_x = f'(x)5 (Verma et al., 7 Dec 2025).

For vector-valued objectives and high-dimensional gx=f(x)g_x = f'(x)6, the update is elementwise: gx=f(x)g_x = f'(x)7 with gx=f(x)g_x = f'(x)8. Optionally, a moving-average filtered gradient gx=f(x)g_x = f'(x)9 may be used for robustness in noisy settings.

2. Adaptive Dynamics and Angle-Based Learning Integration

An alternative ArcGD scheme defines learning rate adaptation via the local geometry of gradient directions. Specifically, the method performs:

  1. At ds=1+(f(x))2dx,ds = \sqrt{1 + (f'(x))^2}\,dx,0, computes the descent direction ds=1+(f(x))2dx,ds = \sqrt{1 + (f'(x))^2}\,dx,1.
  2. Probes at ds=1+(f(x))2dx,ds = \sqrt{1 + (f'(x))^2}\,dx,2 along a direction ds=1+(f(x))2dx,ds = \sqrt{1 + (f'(x))^2}\,dx,3 orthogonal to ds=1+(f(x))2dx,ds = \sqrt{1 + (f'(x))^2}\,dx,4.
  3. Calculates the gradient at ds=1+(f(x))2dx,ds = \sqrt{1 + (f'(x))^2}\,dx,5 as ds=1+(f(x))2dx,ds = \sqrt{1 + (f'(x))^2}\,dx,6.
  4. Computes the angle ds=1+(f(x))2dx,ds = \sqrt{1 + (f'(x))^2}\,dx,7 between ds=1+(f(x))2dx,ds = \sqrt{1 + (f'(x))^2}\,dx,8 and ds=1+(f(x))2dx,ds = \sqrt{1 + (f'(x))^2}\,dx,9:

Tx=gx1+gx2(1,1).T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).0

  1. Sets Tx=gx1+gx2(1,1).T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).1 as the candidate step-size.
  2. Smooths Tx=gx1+gx2(1,1).T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).2 with exponential moving average to yield Tx=gx1+gx2(1,1).T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).3.
  3. If Tx=gx1+gx2(1,1).T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).4, applies a boosting rule Tx=gx1+gx2(1,1).T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).5.
  4. Final update:

Tx=gx1+gx2(1,1).T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).6

This approach enables direct geometric sensitivity to local loss surface curvature, adapting per-step magnitudes without explicit gradient norm scaling (Mishra et al., 2023).

3. Algorithmic Procedure and Variants

The following pseudocode summarizes n-dimensional ArcGD (phase-aware formulation):

Δxa|\Delta x| \le a6 Setting Tx=gx1+gx2(1,1).T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).7 transforms ArcGD into a variant matching the Lion optimizer’s structure, explicitly relating ArcGD’s update decomposition to sign-based adaptive methods (Verma et al., 7 Dec 2025).

4. Theoretical Guarantees and Properties

In angle-based ArcGD, convergence is proven under standard convexity and smoothness conditions. If Tx=gx1+gx2(1,1).T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).8 is convex and differentiable and Tx=gx1+gx2(1,1).T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).9 is Δx=aTx,\Delta x = -a\,T_x,0-Lipschitz, then for any probe-step Δx=aTx,\Delta x = -a\,T_x,1 satisfying

Δx=aTx,\Delta x = -a\,T_x,2

each iteration yields

Δx=aTx,\Delta x = -a\,T_x,3

establishing monotonic decrease and satisfaction of the Wolfe (Armijo) sufficient decrease condition with Δx=aTx,\Delta x = -a\,T_x,4. This supports Δx=aTx,\Delta x = -a\,T_x,5 convergence to an Δx=aTx,\Delta x = -a\,T_x,6-optimal solution under convexity (Mishra et al., 2023). For the bounded update, phase-aware ArcGD, formal global non-convex convergence proofs remain open, but the explicit magnitude control and minimum-step enforcement aim to ensure stability in classic high-curvature pitfalls.

5. Empirical Performance Across Domains

ArcGD has been comprehensively evaluated across geometric and practical benchmarks:

  • Rosenbrock function (up to 50,000 dimensions): On the stochastic Rosenbrock with high curvature and extreme ill-conditioning, ArcGD consistently converged, outperforming Adam in all tested high-dimensional regimes for both matched and default learning rates. At Δx=aTx,\Delta x = -a\,T_x,7, ArcGD averaged 9,197 iterations (distance Δx=aTx,\Delta x = -a\,T_x,8); Adam required 15,658 iterations (distance Δx=aTx,\Delta x = -a\,T_x,9). Adam failed completely at 50,000D in some configurations, while ArcGD remained robust (Verma et al., 7 Dec 2025).
  • Image Classification (CIFAR-10, CIFAR-100, mini-ImageNet): Across diverse MLPs (depth 1–5, up to aa0 million parameters) and standard CNNs (ResNet, DenseNet, VGG, EfficientNet), ArcGD delivered superior or competitive test accuracy. On CIFAR-10 at 20,000 iterations, ArcGD achieved aa1 test accuracy vs. Adam aa2, AdamW aa3, SGD aa4, and Lion aa5. ArcGD won or tied in 6 of 8 MLP architectures, continuing to improve where others regressed with extended training, thus showing intrinsic resistance to overfitting and robustness under long-horizon optimization (Verma et al., 7 Dec 2025, Mishra et al., 2023).
  • Empirical stability: On deep ResNet/DenseNet/EfficientNet, the angle-based variant achieved the highest top-1 accuracy in all early training windows and maintained or improved performance in long epochs, even as Adam-based methods plateaued.

6. Critical Strengths, Limitations, and Use Guidance

Strengths:

  • Explicit elementwise upper bounds (aa6) and minimum floors (aa7) protect against divergent or stagnant updates across all phases of the optimization landscape.
  • Phase-aware decomposition, via parameters aa8, accommodates both aggressive search and conservative convergence, with direct user control.
  • Empirical superiority in both synthetic and real-world, high-dimensional, and highly non-convex scenarios.

Limitations:

  • Introduces additional hyperparameters beyond classical schemes; careful selection of aa9 is necessary for optimal performance.
  • Per-iteration computational cost is increased due to elementwise terms and, in angle-based variants, the requirement of two gradient computations per step.
  • No global, non-convex theoretical convergence guarantees currently available; theoretical analysis of such regimes remains as future work (Verma et al., 7 Dec 2025).

Recommended settings: Δxa|\Delta x| \le a0 and Δxa|\Delta x| \le a1 serve as robust defaults. For speed-critical or Lion-style use, Δxa|\Delta x| \le a2. Adaptive Δxa|\Delta x| \le a3 is advised when tackling extremely vanishing gradients.

ArcGD generalizes classical gradient descent and can specialize to a Lion-like optimizer under specific parameter constraints (Δxa|\Delta x| \le a4). The connection arises by interpreting the signed, elementwise-normalized update as sign-based momentum with linear bias, yielding

Δxa|\Delta x| \le a5

in the limit. Such relations elucidate ArcGD’s applicability as a parent class of controlled, phase-aware optimizers with tunable adaptivity. The potential for momentum and noise-robust variants, as well as further integration with curvature-sensitive adaptive schemes, suggests a broad and flexible optimization landscape (Verma et al., 7 Dec 2025).


Arc Gradient Descent represents a phase-aware, geometrically grounded reformulation of standard gradient methods. Its capacity to interpolate between safe, bounded step dynamics, and aggressive, minimum-progress regimes, both theoretically and empirically robust, positions it as a significant and extensible contribution to modern optimization for deep learning and complex numerical landscapes (Verma et al., 7 Dec 2025, Mishra et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Arc Gradient Descent (ArcGD).