Counting embedding parameters in scaling-law performance extrapolation

Determine whether and how including embedding parameters in total parameter counts should affect downstream performance extrapolation in language model scaling-law analyses (e.g., Kaplan et al., 2020; Hoffmann et al., 2022), particularly when comparing models that differ mainly in LM head rank but share the same Transformer backbone.

Background

When comparing 2B-parameter models with identical Transformer backbones but different effective LM head ranks, the total parameter count varies slightly due to embedding-related parameters. The authors highlight that this small discrepancy could be a confound but note uncertainty about whether embedding parameters should be counted for downstream performance extrapolation in scaling-law contexts.

Clarifying the role of embedding parameters in parameter-count-based extrapolations would help interpret results and control for confounds in studies of training efficiency and performance across different LM head configurations.

References

A potential confounding factor for this experiment is the slight difference in total parameter count across models (ranging from 1.8B to 2.0B parameters). However, it is not clear whether counting embedding parameters is relevant for downstream performance extrapolation~\citep{kaplan_scaling,chinchilla_scaling}, and the reported gaps are much more significant than such size discrepancies would predict.

Lost in Backpropagation: The LM Head is a Gradient Bottleneck  (2603.10145 - Godey et al., 10 Mar 2026) in Subsection 4.1, Consequences on LLM Training