Adaptive Sharpness Surrogate Gradient (ASSG)

Updated 3 January 2026

Adaptive Sharpness Surrogate Gradient (ASSG) is a technique that dynamically modulates the surrogate gradient's sharpness to better approximate true gradients in models with nondifferentiable activations.
It adjusts the sharpness parameter using local input statistics and second-order information, ensuring effective gradient support and mitigating gradient vanishing in spiking neural networks.
Empirical results demonstrate that ASSG boosts adversarial attack success and training efficiency in both spiking neural networks and sharpness-aware minimization schemes.

Adaptive Sharpness Surrogate Gradient (ASSG) is a family of techniques that dynamically modulate the "sharpness" of surrogate gradients to faithfully approximate gradients for learning and adversarial optimization in models involving nondifferentiable operations or objectives. ASSG has been introduced and systematically studied in the context of spiking neural networks (SNNs) to overcome gradient vanishing associated with piecewise constant or discontinuous spike activations, as well as within sharpness-aware minimization schemes in standard deep networks to improve computational efficiency and accuracy. The core principle is the continuous adaptation of surrogate gradient parameters according to local input statistics, network dynamics, or second-order geometry, thus maintaining backpropagation fidelity and efficient optimization.

1. Mathematical Formulation and Motivation

In SNNs, neuron activation is typically driven by the Heaviside step function $H(u) = 1_{u \ge 0}$ , which is discontinuous at threshold and zero almost everywhere else. Direct differentiation yields the zero function almost everywhere, precluding gradient-based training or analysis. Surrogate gradient (SG) methods address this by substituting $\partial H/\partial u$ with a smooth function $g(u)$ . Standard choices for $g(u)$ include rectangular, exponential, or arctangent functions, parameterized by a “sharpness” coefficient $\alpha$ governing the effective support:

$g(u;\alpha) = \frac{\alpha}{2 [1 + ((\pi/2) \alpha u)^2]}$

A fundamental issue arises: a sharp $g(u)$ well approximates the step but yields vanishing gradients except in a narrow band around threshold; conversely, a flat $g(u)$ provides poor approximation but maintains more nonzero gradients. The gradient-vanishing degree $G(x) = \int_{-|x|}^{|x|}g(t)dt$ quantifies the fraction of total gradient mass within a typical deviation $x$ . If membrane potentials stray beyond the surrogate’s effective width, nearly all gradients vanish, stalling optimization (Wang et al., 27 Dec 2025).

ASSG addresses this by dynamically adapting the sharpness parameter $\alpha$ in space, time, and across layers, tracking the evolution of the local pre-activation distribution or higher-order learning signals. In standard deep networks, related adaptive surrogate approaches decompose sharpness-aware minimization (SAM) gradients to selectively compute and reuse second-order components, implementing adaptivity in gradient sampling and mixing (Deng et al., 4 Oct 2025).

2. Algorithmic Construction in Spiking Neural Networks

ASSG for SNNs efficiently controls the trade-off between gradient support width and fidelity by matching the surrogate shape to the empirical distribution of membrane potential deviations $u_{i,t}^l = V_i^l(t+1) - V_{th}$ for each neuron $(i,t,l)$ . The procedure involves the following adaptive update at each attack or training iteration:

Statistics Collection: Maintain an exponential moving average (EMA) of the absolute deviations:

$M_{i,t}^l(k) = \beta_1 M_{i,t}^l(k-1) + (1-\beta_1)|u_{i,t}^l(x_k)|$

Optionally, estimate a relaxation term based on deviation from $M$ :

$D_{i,t}^l(k) = \beta_2 D_{i,t}^l(k−1) + (1−\beta_2)|\,|u_{i,t}^l(x_k)| − M_{i,t}^l(k)|$

Adaptive Sharpness Calculation:

$\alpha_{i,t}^l(k) = \frac{2}{\pi [M_{i,t}^l(k) + \gamma D_{i,t}^l(k)]}\tan\left(\frac{\pi A}{2}\right)$

where $A$ is a user-chosen upper bound on the expected gradient-vanishing degree.

Surrogate Profile Parameterization:

$g(u;\alpha_{i,t}^l) = \frac{\alpha_{i,t}^l}{2 [1 + ((\pi/2)\alpha_{i,t}^l u)^2]}$

This approach guarantees $\mathbb{E}[G(x)] \leq A$ , maximizing sharpness without inducing vanishing gradients for the in-distribution range of membrane potential deviations (Wang et al., 27 Dec 2025).

3. Integration into SNN Adversarial Attack and Training

ASSG is integrated into adversarial attack generation using the Stable Adaptive Projected Gradient Descent (SA-PGD) scheme, which incorporates:

Adaptive surrogate sharpness as described above,
Momentum updates for the adversarial step, combining $\ell_1$ -normalized gradients with a tracked variance,
Per-step $L_\infty$ clipping and projection to maintain attack budget,
Automatic adjustment of update step sizes, leveraging the stabilized gradients produced by ASSG.

This framework enables robust and efficient maximization of adversarial loss, exposing vulnerabilities that standard fixed-surrogate attacks miss. In standard SNN training, variants of ASSG (e.g., MPD-AGL) use layer- and timestep-adaptive kernel widths $\kappa^l[t]$ proportional to the standard deviation of the empirical membrane potential distribution, maintaining a consistent overlap with gradient-support regions and preventing both vanishing and exploding gradients (Jiang et al., 17 May 2025).

4. Comparative Methodological Properties

The table below summarizes key differences between fixed and adaptive surrogate gradients in SNNs:

Scheme	Surrogate Parameter	Adaptivity Target	Main Benefit
Fixed SG	$\alpha$ , $\kappa$ constant	None	Simplicity, but susceptible to gradient vanishing
Adaptive SG (ASSG)	$\alpha_{i,t}^l$ , $\kappa^l[t]$	EMA/MPD statistics	Maintains gradient support and fidelity, mitigates vanishing

ASSG generalizes across neuron models (LIF-2, IF, PSN), network depths, and attack or training regimes, outperforming fixed surrogates in both attack success rate and effective optimization (Wang et al., 27 Dec 2025, Jiang et al., 17 May 2025).

5. ASSG in Sharpness-Aware Minimization and Efficient Deep Learning

In the context of Sharpness-Aware Minimization (SAM), ASSG enables computationally efficient surrogates for the expensive double-pass gradient calculations required by vanilla SAM. The SAM update is formally decomposed into the usual SGD gradient and a "Projection of the Second-order component onto the First-order gradient" (PSF):

$g_{SAM} \approx \nabla L(w) + \rho(\nabla^2 L(w) \nabla L(w)) / \|\nabla L(w)\|$

ASSG in ARSAM adaptively samples and reuses the PSF, modulating the frequency of true PSF computations by an autoregressive update on their relative magnitude and temporal smoothness. When not resampling, the previous PSF is mixed with the current SGD gradient, drastically reducing runtime by up to $40\%$ without compromising generalization performance on standard benchmarks (Deng et al., 4 Oct 2025).

6. Empirical Evaluation and Impact

ASSG-based methods have demonstrated the following empirical outcomes:

In SNN adversarial evaluation: On CIFAR-10 under adversarial training, standard APGD+STBP attains $75.38\%$ attack success rate (ASR); APGD+ASSG achieves $84.06\%$ , and SA-PGD+ASSG reaches $88.44\%$ . On neuromorphic CIFAR10-DVS, SA-PGD+ASSG attains $49.10\%$ ASR vs $36.10\%$ for STBP. ASSG increases gradient-available neurons ( $\sim16\%$ vs. $\sim9\%$ for fixed surrogates), enhances classification at low latency (CIFAR-10 accuracy $96.18\%$ at $T=2$ ), and achieves energy savings ( $0.55\, \mathrm{mJ}$ ) (Wang et al., 27 Dec 2025, Jiang et al., 17 May 2025).
In sharpness-aware deep learning: ARSAM equipped with ASSG delivers test accuracy within $0.1$pt of full SAM on CIFAR-100 while performing only $\sim30\%$ of expensive SAM computations and achieving $136\%$ throughput. Adaptive reuse of PSF enhances speedup and stability with negligible accuracy degradation (Deng et al., 4 Oct 2025).

7. Theoretical and Practical Implications

ASSG establishes that adaptive surrogate parameterization—conditioned on local feature statistics or second-order geometry—is essential to avoid overestimating robustness or underutilizing optimization paths in modern neural architectures. In SNNs, ASSG demonstrates that previously measured robustness against gradient-based attacks was largely influenced by the choice of surrogate, rather than any intrinsic property of spiking computation. A plausible implication is that robust SNN training, as well as certified robustness techniques, should integrate adaptive surrogates to ensure valid worst-case evaluations (Wang et al., 27 Dec 2025).

Future research avenues include integrating ASSG (and related MPD-AGL approaches) directly into adversarial training protocols, exploring their synergy with certified robustness and randomized smoothing, and developing low-overhead hardware approximations for neuromorphic deployment. In sharpness-aware minimization, autoregressive and selectively-updated surrogate strategies (as in ARSAM) provide a template for scalable, efficient training at scale without sacrificing the flat-minimum preference of modern optimization objectives (Deng et al., 4 Oct 2025).