Does the smallest GPT-2 model exhibit saturation during training?
Determine whether the smallest GPT-2 model suffers from performance saturation during pretraining, i.e., a late-stage degradation and plateau in evaluation or in-domain loss, given that strong last-layer anisotropy is observed for this model.
References
Although we observe strong last-layer anisotropy for the smallest GPT-2 model, we cannot tell with certainty whether it suffered from saturation.
— Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
(2404.07647 - Godey et al., 2024) in Limitations