Link between gradient bottleneck and spectral-saturation optimization dynamics

Ascertain whether the gradient bottleneck effect caused by compressing V-dimensional logit gradients through a rank-D language modeling head (i.e., backpropagating through a low-rank linear projection that discards most of the logit-gradient components) explains or contributes to the unstable optimization dynamics associated with representation degeneration and spectral saturation hypothesized for small language models by Godey et al. (2024). Specifically, determine if and how the identified gradient compression mechanism sheds light on those instability phenomena and the associated performance drops.

Background

Prior work has documented a representation degeneration phenomenon in neural LLMs, where output embeddings collapse into a narrow cone, and has linked this to the softmax bottleneck; it has been hypothesized that the bottleneck can lead to unstable training dynamics when the LLM head reaches a regime of spectral saturation. This paper provides a complementary optimization perspective, showing that backpropagation through a low-rank LM head severely compresses logit gradients, discarding most of the training signal.

The authors explicitly state it is unclear whether their gradient-bottleneck analysis explains the previously hypothesized unstable dynamics. Establishing or refuting a causal relationship between gradient compression and spectral-saturation-induced instabilities remains an open question.

References

They hypothesize that the softmax bottleneck can lead to optimization issues through unstable dynamics when the LM head reaches a form of spectral saturation. It is unclear whether our findings could shed light on these optimization dynamics, but we argue that even when the representations do not degenerate, the gradient bottleneck effect is still limiting the amount of information that can be backpropagated to the model.

Lost in Backpropagation: The LM Head is a Gradient Bottleneck  (2603.10145 - Godey et al., 10 Mar 2026) in Section 6, Related Works and Discussion — Representation Degeneration and Gradient Flow