Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy
This presentation explores a fundamental geometric constraint governing neural network training with cross-entropy loss. The paper reveals how complex singularities in the softmax partition function—termed 'ghosts of softmax'—create branch points that dictate safe step sizes through Taylor convergence radius, independent of real-line curvature. The authors introduce a tractable bound based on directional logit derivatives that predicts instabilities missed by traditional smoothness-based analysis, and demonstrate a robust controller that survives extreme learning rate perturbations while achieving competitive accuracy without hand-tuned schedules.Script
Neural networks collapse during training not from poor curvature, but from invisible boundaries drawn by complex singularities—mathematical ghosts that haunt the softmax function and dictate where optimization can safely step.
The authors show that local surrogate models break down not where curvature explodes, but where complex analysis dictates: at the radius determined by zeros of the softmax partition function in the complex plane, creating branch points that wall off safe parameter space.
So how do we locate these invisible boundaries without exploring the entire complex plane?
The paper derives a tractable lower bound requiring only the spread of directional logit derivatives, computable with a single forward-mode derivative pass. This yields a normalized step size coordinate r that collapses instability thresholds across architectures and temperatures by a factor of 6.
Controlled experiments confirm the bound's power: a rho sub a based controller withstands adversarial learning rate spikes that destroy Adam and gradient clipping, while realistic ResNet-18 training achieves strong accuracy without schedules, and every observed instability aligns with violations of the geometric bound.
This constraint operates independently from Hessian-based logic: late in training, confident predictions drive delta sub a upward, shrinking rho sub a even as curvature flatlines, explaining sudden late-stage fragility in large models and motivating radius-aware activation and normalization designs.
The ghosts of softmax reveal that safe optimization lives within circles drawn by complex analysis, not just valleys carved by curvature. Visit EmergentMind.com to explore this paper further and create your own research videos.