Link between gradient bottleneck and spectral-saturation optimization dynamics
Ascertain whether the gradient bottleneck effect caused by compressing V-dimensional logit gradients through a rank-D language modeling head (i.e., backpropagating through a low-rank linear projection that discards most of the logit-gradient components) explains or contributes to the unstable optimization dynamics associated with representation degeneration and spectral saturation hypothesized for small language models by Godey et al. (2024). Specifically, determine if and how the identified gradient compression mechanism sheds light on those instability phenomena and the associated performance drops.
References
They hypothesize that the softmax bottleneck can lead to optimization issues through unstable dynamics when the LM head reaches a form of spectral saturation. It is unclear whether our findings could shed light on these optimization dynamics, but we argue that even when the representations do not degenerate, the gradient bottleneck effect is still limiting the amount of information that can be backpropagated to the model.