Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability

Published 17 Oct 2025 in cs.LG, cs.NA, and math.NA | (2510.21770v1)

Abstract: Transformers trained in low precision can suffer forward-error amplification. We give a first-order, module-wise theory that predicts when and where errors grow. For self-attention we derive a per-layer bound that factorizes into three interpretable diagnostics: a score-scale ratio $\kappa_{\rm score}$, a rowwise softmax sensitivity $\kappa_{\rm softmax}$, and value conditioning $\kappa(V)$. We prove a residual relaxation inequality showing that residual blocks attenuate depth-wise accumulation, and we introduce a precision- and width-aware LayerNorm indicator $\rho_{\rm LN}$ with a matching first-order bound in the $\epsilon$-dominated regime. These pieces yield a unified forward-stability bound whose right-hand side is directly estimable during training. On Tiny-ViT/CIFAR-10 we evaluate the bound and components. (1) The combined predictor $\kappa_{\rm softmax},(1+\kappa_{\rm score}),\kappa(V),|W_O|2+\kappa{\rm eff}+C_{\rm LN}$ tracks FP32$\leftrightarrow$LP mismatches across seeds, widths, and precisions; scaling by $\epsilon_{\rm mach}$ collapses mixed-precision points. (2) The time-series maximum of $\kappa_{\rm softmax}$ acts as an early-warning signal, leading error spikes by 16-24 steps (corr. 0.65-0.82; permutation $p!\approx!10^{-3}$; Precision@K 0.89-1.00). (3) Guided by $\rho_{\rm LN}$, a small LayerNorm-$\epsilon$ tweak targeting $\rho_\star$ gives consistent stabilization (mean tail-loss $\downarrow\ \approx0.010$ at $\rho_\star!=!0.6$, cap$=10^{-2}$) with negligible overhead. Overall, our theory supplies actionable, unitless diagnostics that (i) explain when self-attention is fragile, (ii) forecast instability, and (iii) motivate a minimally invasive mitigation.