Counting embedding parameters in scaling-law performance extrapolation
Determine whether and how including embedding parameters in total parameter counts should affect downstream performance extrapolation in language model scaling-law analyses (e.g., Kaplan et al., 2020; Hoffmann et al., 2022), particularly when comparing models that differ mainly in LM head rank but share the same Transformer backbone.
References
A potential confounding factor for this experiment is the slight difference in total parameter count across models (ranging from 1.8B to 2.0B parameters). However, it is not clear whether counting embedding parameters is relevant for downstream performance extrapolation~\citep{kaplan_scaling,chinchilla_scaling}, and the reported gaps are much more significant than such size discrepancies would predict.