Identify the most effective and efficient embedding scaling strategy across regimes

Ascertain which embedding scaling strategy—structural expansion via Per-Layer Embedding that allocates independent embedding parameters to each layer, or vocabulary expansion via N-gram Embedding using hashed n-gram lookup tables—achieves superior effectiveness and efficiency under different scaling regimes of large language models.

Background

Multiple embedding scaling strategies are surveyed: Per-Layer Embedding (PLE) distributes additional embedding capacity across layers, while N-gram Embedding expands input representations via large, hashed n-gram tables. Although the paper presents empirical comparisons in specific settings, the authors state that it remains unclear which approach is best more generally.

Resolving this question would inform model designers when to prefer structural embedding expansion versus vocabulary-centric expansion, depending on model scale, sparsity, and deployment constraints.

References

Third, while some methods for scaling embeddings have been proposed, it is still unclear which scaling strategy is more effective and efficient under different regimes.

Scaling Embeddings Outperforms Scaling Experts in Language Models  (2601.21204 - Liu et al., 29 Jan 2026) in Section 1 Introduction