Arc Gradient Descent (ArcGD) Optimization
- Arc Gradient Descent is a phase-aware optimization framework that leverages arc-length geometry to compute bounded and controlled gradient updates.
- It adjusts update dynamics using saturation, transition, and floor terms to mitigate exploding and vanishing gradients in high-dimensional settings.
- Empirical tests on benchmarks like the Rosenbrock function and CIFAR image classification demonstrate its robust convergence and superior stability compared to traditional optimizers.
Arc Gradient Descent (ArcGD) is an optimization framework formulated to provide mathematically controlled, phase-aware step sizes in gradient-based minimization. Its key innovations include explicit elementwise update bounds and user-tunable step-size dynamics that address both exploding and vanishing gradients, thereby improving stability and generalization for highly non-convex and high-dimensional objectives, particularly in deep learning and geometric stress-test settings (Verma et al., 7 Dec 2025, Mishra et al., 2023).
1. Mathematical Foundations and Update Rule
ArcGD defines the step update by considering the arc-length geometry of the objective function. In scalar form, let be a differentiable function with . The method utilizes the approximation
leading to the fundamental normalised gradient
The most elementary update is then
where is a positive ceiling, enforcing .
To robustly traverse distinct optimization regimes (“phases”), ArcGD introduces two additional terms: a transition coefficient and a floor ,
Consequently:
- The saturation term dominates for , enforcing upper-bounded steps in steep, unstable regions.
- The transition term smooths the step-size in intermediate-gradient regions, giving more linear “GD-like” behavior.
- The floor term guarantees nonzero progress even when gradients largely vanish.
By parametric control of , users directly govern the ceiling, transition acceleration, and minimum update. These terms are interpreted as controlling “high”, “transition”, and “vanishing” gradient phases, respectively. The composite coefficient acts as the effective learning rate (Verma et al., 7 Dec 2025).
For vector-valued objectives and high-dimensional , the update is elementwise: with . Optionally, a moving-average filtered gradient may be used for robustness in noisy settings.
2. Adaptive Dynamics and Angle-Based Learning Integration
An alternative ArcGD scheme defines learning rate adaptation via the local geometry of gradient directions. Specifically, the method performs:
- At , computes the descent direction .
- Probes at along a direction orthogonal to .
- Calculates the gradient at as .
- Computes the angle between and :
- Sets as the candidate step-size.
- Smooths with exponential moving average to yield .
- If , applies a boosting rule .
- Final update:
This approach enables direct geometric sensitivity to local loss surface curvature, adapting per-step magnitudes without explicit gradient norm scaling (Mishra et al., 2023).
3. Algorithmic Procedure and Variants
The following pseudocode summarizes n-dimensional ArcGD (phase-aware formulation):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
x = x0 if use_momentum: m = 0 for t in range(T_max): g = grad(f, x) if use_momentum: m = beta * m + (1 - beta) * g g_eff = m else: g_eff = g for i in range(n): T_i = g_eff[i] / sqrt(1 + g_eff[i]**2) eta_mid = b * (1 - abs(T_i)) * T_i c_adapt = min(c, c * abs(T_i) / (1 - abs(T_i))) eta_low = c_adapt * sign(T_i) * (1 - abs(T_i)) dx[i] = -a * T_i - eta_mid - eta_low x = x + dx return x |
4. Theoretical Guarantees and Properties
In angle-based ArcGD, convergence is proven under standard convexity and smoothness conditions. If is convex and differentiable and is -Lipschitz, then for any probe-step satisfying
each iteration yields
establishing monotonic decrease and satisfaction of the Wolfe (Armijo) sufficient decrease condition with . This supports convergence to an -optimal solution under convexity (Mishra et al., 2023). For the bounded update, phase-aware ArcGD, formal global non-convex convergence proofs remain open, but the explicit magnitude control and minimum-step enforcement aim to ensure stability in classic high-curvature pitfalls.
5. Empirical Performance Across Domains
ArcGD has been comprehensively evaluated across geometric and practical benchmarks:
- Rosenbrock function (up to 50,000 dimensions): On the stochastic Rosenbrock with high curvature and extreme ill-conditioning, ArcGD consistently converged, outperforming Adam in all tested high-dimensional regimes for both matched and default learning rates. At , ArcGD averaged 9,197 iterations (distance ); Adam required 15,658 iterations (distance ). Adam failed completely at 50,000D in some configurations, while ArcGD remained robust (Verma et al., 7 Dec 2025).
- Image Classification (CIFAR-10, CIFAR-100, mini-ImageNet): Across diverse MLPs (depth 1–5, up to $5.5$ million parameters) and standard CNNs (ResNet, DenseNet, VGG, EfficientNet), ArcGD delivered superior or competitive test accuracy. On CIFAR-10 at 20,000 iterations, ArcGD achieved test accuracy vs. Adam , AdamW , SGD , and Lion . ArcGD won or tied in 6 of 8 MLP architectures, continuing to improve where others regressed with extended training, thus showing intrinsic resistance to overfitting and robustness under long-horizon optimization (Verma et al., 7 Dec 2025, Mishra et al., 2023).
- Empirical stability: On deep ResNet/DenseNet/EfficientNet, the angle-based variant achieved the highest top-1 accuracy in all early training windows and maintained or improved performance in long epochs, even as Adam-based methods plateaued.
6. Critical Strengths, Limitations, and Use Guidance
Strengths:
- Explicit elementwise upper bounds () and minimum floors () protect against divergent or stagnant updates across all phases of the optimization landscape.
- Phase-aware decomposition, via parameters , accommodates both aggressive search and conservative convergence, with direct user control.
- Empirical superiority in both synthetic and real-world, high-dimensional, and highly non-convex scenarios.
Limitations:
- Introduces additional hyperparameters beyond classical schemes; careful selection of is necessary for optimal performance.
- Per-iteration computational cost is increased due to elementwise terms and, in angle-based variants, the requirement of two gradient computations per step.
- No global, non-convex theoretical convergence guarantees currently available; theoretical analysis of such regimes remains as future work (Verma et al., 7 Dec 2025).
Recommended settings: and serve as robust defaults. For speed-critical or Lion-style use, . Adaptive is advised when tackling extremely vanishing gradients.
7. Connections to Related Optimizers and Extensions
ArcGD generalizes classical gradient descent and can specialize to a Lion-like optimizer under specific parameter constraints (). The connection arises by interpreting the signed, elementwise-normalized update as sign-based momentum with linear bias, yielding
in the limit. Such relations elucidate ArcGD’s applicability as a parent class of controlled, phase-aware optimizers with tunable adaptivity. The potential for momentum and noise-robust variants, as well as further integration with curvature-sensitive adaptive schemes, suggests a broad and flexible optimization landscape (Verma et al., 7 Dec 2025).
Arc Gradient Descent represents a phase-aware, geometrically grounded reformulation of standard gradient methods. Its capacity to interpolate between safe, bounded step dynamics, and aggressive, minimum-progress regimes, both theoretically and empirically robust, positions it as a significant and extensible contribution to modern optimization for deep learning and complex numerical landscapes (Verma et al., 7 Dec 2025, Mishra et al., 2023).