Arc Gradient Descent (ArcGD) Optimization

Updated 30 December 2025

Arc Gradient Descent is a phase-aware optimization framework that leverages arc-length geometry to compute bounded and controlled gradient updates.
It adjusts update dynamics using saturation, transition, and floor terms to mitigate exploding and vanishing gradients in high-dimensional settings.
Empirical tests on benchmarks like the Rosenbrock function and CIFAR image classification demonstrate its robust convergence and superior stability compared to traditional optimizers.

Arc Gradient Descent (ArcGD) is an optimization framework formulated to provide mathematically controlled, phase-aware step sizes in gradient-based minimization. Its key innovations include explicit elementwise update bounds and user-tunable step-size dynamics that address both exploding and vanishing gradients, thereby improving stability and generalization for highly non-convex and high-dimensional objectives, particularly in deep learning and geometric stress-test settings (Verma et al., 7 Dec 2025, Mishra et al., 2023).

1. Mathematical Foundations and Update Rule

ArcGD defines the step update by considering the arc-length geometry of the objective function. In scalar form, let $f(x)$ be a differentiable function with $g_x = f'(x)$ . The method utilizes the approximation

$ds = \sqrt{1 + (f'(x))^2}\,dx,$

leading to the fundamental normalised gradient

$T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).$

The most elementary update is then

$\Delta x = -a\,T_x,$

where $a$ is a positive ceiling, enforcing $|\Delta x| \le a$ .

To robustly traverse distinct optimization regimes (“phases”), ArcGD introduces two additional terms: a transition coefficient $b$ and a floor $c$ ,

$\Delta x = -a\,T_x\; -\; b\,T_x\,(1-|T_x|)\; -\; c\,\mathrm{sign}(T_x)\,(1-|T_x|).$

Consequently:

The saturation term $g_x = f'(x)$ 0 dominates for $g_x = f'(x)$ 1, enforcing upper-bounded steps in steep, unstable regions.
The transition term smooths the step-size in intermediate-gradient regions, giving more linear “GD-like” behavior.
The floor term $g_x = f'(x)$ 2 guarantees nonzero progress even when gradients largely vanish.

By parametric control of $g_x = f'(x)$ 3, users directly govern the ceiling, transition acceleration, and minimum update. These terms are interpreted as controlling “high”, “transition”, and “vanishing” gradient phases, respectively. The composite coefficient $g_x = f'(x)$ 4 acts as the effective learning rate $g_x = f'(x)$ 5 (Verma et al., 7 Dec 2025).

For vector-valued objectives and high-dimensional $g_x = f'(x)$ 6, the update is elementwise: $g_x = f'(x)$ 7 with $g_x = f'(x)$ 8. Optionally, a moving-average filtered gradient $g_x = f'(x)$ 9 may be used for robustness in noisy settings.

2. Adaptive Dynamics and Angle-Based Learning Integration

An alternative ArcGD scheme defines learning rate adaptation via the local geometry of gradient directions. Specifically, the method performs:

At $ds = \sqrt{1 + (f'(x))^2}\,dx,$ 0, computes the descent direction $ds = \sqrt{1 + (f'(x))^2}\,dx,$ 1.
Probes at $ds = \sqrt{1 + (f'(x))^2}\,dx,$ 2 along a direction $ds = \sqrt{1 + (f'(x))^2}\,dx,$ 3 orthogonal to $ds = \sqrt{1 + (f'(x))^2}\,dx,$ 4.
Calculates the gradient at $ds = \sqrt{1 + (f'(x))^2}\,dx,$ 5 as $ds = \sqrt{1 + (f'(x))^2}\,dx,$ 6.
Computes the angle $ds = \sqrt{1 + (f'(x))^2}\,dx,$ 7 between $ds = \sqrt{1 + (f'(x))^2}\,dx,$ 8 and $ds = \sqrt{1 + (f'(x))^2}\,dx,$ 9:

$T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).$ 0

Sets $T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).$ 1 as the candidate step-size.
Smooths $T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).$ 2 with exponential moving average to yield $T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).$ 3.
If $T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).$ 4, applies a boosting rule $T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).$ 5.
Final update:

$T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).$ 6

This approach enables direct geometric sensitivity to local loss surface curvature, adapting per-step magnitudes without explicit gradient norm scaling (Mishra et al., 2023).

3. Algorithmic Procedure and Variants

The following pseudocode summarizes n-dimensional ArcGD (phase-aware formulation):

$|\Delta x| \le a$ 6 Setting $T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).$ 7 transforms ArcGD into a variant matching the Lion optimizer’s structure, explicitly relating ArcGD’s update decomposition to sign-based adaptive methods (Verma et al., 7 Dec 2025).

4. Theoretical Guarantees and Properties

In angle-based ArcGD, convergence is proven under standard convexity and smoothness conditions. If $T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).$ 8 is convex and differentiable and $T_x = \frac{g_x}{\sqrt{1+g_x^2}} \in (-1, 1).$ 9 is $\Delta x = -a\,T_x,$ 0-Lipschitz, then for any probe-step $\Delta x = -a\,T_x,$ 1 satisfying

$\Delta x = -a\,T_x,$ 2

each iteration yields

$\Delta x = -a\,T_x,$ 3

establishing monotonic decrease and satisfaction of the Wolfe (Armijo) sufficient decrease condition with $\Delta x = -a\,T_x,$ 4. This supports $\Delta x = -a\,T_x,$ 5 convergence to an $\Delta x = -a\,T_x,$ 6-optimal solution under convexity (Mishra et al., 2023). For the bounded update, phase-aware ArcGD, formal global non-convex convergence proofs remain open, but the explicit magnitude control and minimum-step enforcement aim to ensure stability in classic high-curvature pitfalls.

5. Empirical Performance Across Domains

ArcGD has been comprehensively evaluated across geometric and practical benchmarks:

Rosenbrock function (up to 50,000 dimensions): On the stochastic Rosenbrock with high curvature and extreme ill-conditioning, ArcGD consistently converged, outperforming Adam in all tested high-dimensional regimes for both matched and default learning rates. At $\Delta x = -a\,T_x,$ 7, ArcGD averaged 9,197 iterations (distance $\Delta x = -a\,T_x,$ 8); Adam required 15,658 iterations (distance $\Delta x = -a\,T_x,$ 9). Adam failed completely at 50,000D in some configurations, while ArcGD remained robust (Verma et al., 7 Dec 2025).
Image Classification (CIFAR-10, CIFAR-100, mini-ImageNet): Across diverse MLPs (depth 1–5, up to $a$ 0 million parameters) and standard CNNs (ResNet, DenseNet, VGG, EfficientNet), ArcGD delivered superior or competitive test accuracy. On CIFAR-10 at 20,000 iterations, ArcGD achieved $a$ 1 test accuracy vs. Adam $a$ 2, AdamW $a$ 3, SGD $a$ 4, and Lion $a$ 5. ArcGD won or tied in 6 of 8 MLP architectures, continuing to improve where others regressed with extended training, thus showing intrinsic resistance to overfitting and robustness under long-horizon optimization (Verma et al., 7 Dec 2025, Mishra et al., 2023).
Empirical stability: On deep ResNet/DenseNet/EfficientNet, the angle-based variant achieved the highest top-1 accuracy in all early training windows and maintained or improved performance in long epochs, even as Adam-based methods plateaued.

6. Critical Strengths, Limitations, and Use Guidance

Strengths:

Explicit elementwise upper bounds ( $a$ 6) and minimum floors ( $a$ 7) protect against divergent or stagnant updates across all phases of the optimization landscape.
Phase-aware decomposition, via parameters $a$ 8, accommodates both aggressive search and conservative convergence, with direct user control.
Empirical superiority in both synthetic and real-world, high-dimensional, and highly non-convex scenarios.

Limitations:

Introduces additional hyperparameters beyond classical schemes; careful selection of $a$ 9 is necessary for optimal performance.
Per-iteration computational cost is increased due to elementwise terms and, in angle-based variants, the requirement of two gradient computations per step.
No global, non-convex theoretical convergence guarantees currently available; theoretical analysis of such regimes remains as future work (Verma et al., 7 Dec 2025).

ArcGD generalizes classical gradient descent and can specialize to a Lion-like optimizer under specific parameter constraints ( $|\Delta x| \le a$ 4). The connection arises by interpreting the signed, elementwise-normalized update as sign-based momentum with linear bias, yielding

$|\Delta x| \le a$ 5

in the limit. Such relations elucidate ArcGD’s applicability as a parent class of controlled, phase-aware optimizers with tunable adaptivity. The potential for momentum and noise-robust variants, as well as further integration with curvature-sensitive adaptive schemes, suggests a broad and flexible optimization landscape (Verma et al., 7 Dec 2025).

Arc Gradient Descent represents a phase-aware, geometrically grounded reformulation of standard gradient methods. Its capacity to interpolate between safe, bounded step dynamics, and aggressive, minimum-progress regimes, both theoretically and empirically robust, positions it as a significant and extensible contribution to modern optimization for deep learning and complex numerical landscapes (Verma et al., 7 Dec 2025, Mishra et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Arc Gradient Descent: A Mathematically Derived Reformulation of Gradient Descent with Phase-Aware, User-Controlled Step Dynamics (2025)

Angle based dynamic learning rate for gradient descent (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Arc Gradient Descent (ArcGD).