Does the smallest GPT-2 model exhibit saturation during training?

Determine whether the smallest GPT-2 model suffers from performance saturation during pretraining, i.e., a late-stage degradation and plateau in evaluation or in-domain loss, given that strong last-layer anisotropy is observed for this model.

Background

The paper studies performance saturation in small LLMs, linking it to the softmax bottleneck and representation degeneration. Using the Pythia suite with extensive intermediate checkpoints, the authors observe that smaller models display last-layer anisotropy and a corresponding performance drop later in training.

For GPT-2, while strong last-layer anisotropy is noted for the smallest model, the authors lack sufficient training dynamics evidence (e.g., released intermediate checkpoints) to confirm whether it underwent the same saturation behavior observed in Pythia models.

References

Although we observe strong last-layer anisotropy for the smallest GPT-2 model, we cannot tell with certainty whether it suffered from saturation.