Adaptive Sharpness Surrogate Gradient (ASSG)
- Adaptive Sharpness Surrogate Gradient (ASSG) is a technique that dynamically modulates the surrogate gradient's sharpness to better approximate true gradients in models with nondifferentiable activations.
- It adjusts the sharpness parameter using local input statistics and second-order information, ensuring effective gradient support and mitigating gradient vanishing in spiking neural networks.
- Empirical results demonstrate that ASSG boosts adversarial attack success and training efficiency in both spiking neural networks and sharpness-aware minimization schemes.
Adaptive Sharpness Surrogate Gradient (ASSG) is a family of techniques that dynamically modulate the "sharpness" of surrogate gradients to faithfully approximate gradients for learning and adversarial optimization in models involving nondifferentiable operations or objectives. ASSG has been introduced and systematically studied in the context of spiking neural networks (SNNs) to overcome gradient vanishing associated with piecewise constant or discontinuous spike activations, as well as within sharpness-aware minimization schemes in standard deep networks to improve computational efficiency and accuracy. The core principle is the continuous adaptation of surrogate gradient parameters according to local input statistics, network dynamics, or second-order geometry, thus maintaining backpropagation fidelity and efficient optimization.
1. Mathematical Formulation and Motivation
In SNNs, neuron activation is typically driven by the Heaviside step function , which is discontinuous at threshold and zero almost everywhere else. Direct differentiation yields the zero function almost everywhere, precluding gradient-based training or analysis. Surrogate gradient (SG) methods address this by substituting with a smooth function . Standard choices for include rectangular, exponential, or arctangent functions, parameterized by a “sharpness” coefficient governing the effective support:
A fundamental issue arises: a sharp well approximates the step but yields vanishing gradients except in a narrow band around threshold; conversely, a flat provides poor approximation but maintains more nonzero gradients. The gradient-vanishing degree quantifies the fraction of total gradient mass within a typical deviation . If membrane potentials stray beyond the surrogate’s effective width, nearly all gradients vanish, stalling optimization (Wang et al., 27 Dec 2025).
ASSG addresses this by dynamically adapting the sharpness parameter in space, time, and across layers, tracking the evolution of the local pre-activation distribution or higher-order learning signals. In standard deep networks, related adaptive surrogate approaches decompose sharpness-aware minimization (SAM) gradients to selectively compute and reuse second-order components, implementing adaptivity in gradient sampling and mixing (Deng et al., 4 Oct 2025).
2. Algorithmic Construction in Spiking Neural Networks
ASSG for SNNs efficiently controls the trade-off between gradient support width and fidelity by matching the surrogate shape to the empirical distribution of membrane potential deviations for each neuron . The procedure involves the following adaptive update at each attack or training iteration:
- Statistics Collection: Maintain an exponential moving average (EMA) of the absolute deviations:
Optionally, estimate a relaxation term based on deviation from :
- Adaptive Sharpness Calculation:
where is a user-chosen upper bound on the expected gradient-vanishing degree.
- Surrogate Profile Parameterization:
This approach guarantees , maximizing sharpness without inducing vanishing gradients for the in-distribution range of membrane potential deviations (Wang et al., 27 Dec 2025).
3. Integration into SNN Adversarial Attack and Training
ASSG is integrated into adversarial attack generation using the Stable Adaptive Projected Gradient Descent (SA-PGD) scheme, which incorporates:
- Adaptive surrogate sharpness as described above,
- Momentum updates for the adversarial step, combining -normalized gradients with a tracked variance,
- Per-step clipping and projection to maintain attack budget,
- Automatic adjustment of update step sizes, leveraging the stabilized gradients produced by ASSG.
This framework enables robust and efficient maximization of adversarial loss, exposing vulnerabilities that standard fixed-surrogate attacks miss. In standard SNN training, variants of ASSG (e.g., MPD-AGL) use layer- and timestep-adaptive kernel widths proportional to the standard deviation of the empirical membrane potential distribution, maintaining a consistent overlap with gradient-support regions and preventing both vanishing and exploding gradients (Jiang et al., 17 May 2025).
4. Comparative Methodological Properties
The table below summarizes key differences between fixed and adaptive surrogate gradients in SNNs:
| Scheme | Surrogate Parameter | Adaptivity Target | Main Benefit |
|---|---|---|---|
| Fixed SG | , constant | None | Simplicity, but susceptible to gradient vanishing |
| Adaptive SG (ASSG) | , | EMA/MPD statistics | Maintains gradient support and fidelity, mitigates vanishing |
ASSG generalizes across neuron models (LIF-2, IF, PSN), network depths, and attack or training regimes, outperforming fixed surrogates in both attack success rate and effective optimization (Wang et al., 27 Dec 2025, Jiang et al., 17 May 2025).
5. ASSG in Sharpness-Aware Minimization and Efficient Deep Learning
In the context of Sharpness-Aware Minimization (SAM), ASSG enables computationally efficient surrogates for the expensive double-pass gradient calculations required by vanilla SAM. The SAM update is formally decomposed into the usual SGD gradient and a "Projection of the Second-order component onto the First-order gradient" (PSF):
ASSG in ARSAM adaptively samples and reuses the PSF, modulating the frequency of true PSF computations by an autoregressive update on their relative magnitude and temporal smoothness. When not resampling, the previous PSF is mixed with the current SGD gradient, drastically reducing runtime by up to without compromising generalization performance on standard benchmarks (Deng et al., 4 Oct 2025).
6. Empirical Evaluation and Impact
ASSG-based methods have demonstrated the following empirical outcomes:
- In SNN adversarial evaluation: On CIFAR-10 under adversarial training, standard APGD+STBP attains attack success rate (ASR); APGD+ASSG achieves , and SA-PGD+ASSG reaches . On neuromorphic CIFAR10-DVS, SA-PGD+ASSG attains ASR vs for STBP. ASSG increases gradient-available neurons ( vs. for fixed surrogates), enhances classification at low latency (CIFAR-10 accuracy at ), and achieves energy savings () (Wang et al., 27 Dec 2025, Jiang et al., 17 May 2025).
- In sharpness-aware deep learning: ARSAM equipped with ASSG delivers test accuracy within $0.1$pt of full SAM on CIFAR-100 while performing only of expensive SAM computations and achieving throughput. Adaptive reuse of PSF enhances speedup and stability with negligible accuracy degradation (Deng et al., 4 Oct 2025).
7. Theoretical and Practical Implications
ASSG establishes that adaptive surrogate parameterization—conditioned on local feature statistics or second-order geometry—is essential to avoid overestimating robustness or underutilizing optimization paths in modern neural architectures. In SNNs, ASSG demonstrates that previously measured robustness against gradient-based attacks was largely influenced by the choice of surrogate, rather than any intrinsic property of spiking computation. A plausible implication is that robust SNN training, as well as certified robustness techniques, should integrate adaptive surrogates to ensure valid worst-case evaluations (Wang et al., 27 Dec 2025).
Future research avenues include integrating ASSG (and related MPD-AGL approaches) directly into adversarial training protocols, exploring their synergy with certified robustness and randomized smoothing, and developing low-overhead hardware approximations for neuromorphic deployment. In sharpness-aware minimization, autoregressive and selectively-updated surrogate strategies (as in ARSAM) provide a template for scalable, efficient training at scale without sacrificing the flat-minimum preference of modern optimization objectives (Deng et al., 4 Oct 2025).