Variable-Length Semantic IDs for Recommender Systems

Published 18 Feb 2026 in cs.IR, cs.CL, and cs.LG | (2602.16375v1)

Abstract: Generative models are increasingly used in recommender systems, both for modeling user behavior as event sequences and for integrating LLMs into recommendation pipelines. A key challenge in this setting is the extremely large cardinality of item spaces, which makes training generative models difficult and introduces a vocabulary gap between natural language and item identifiers. Semantic identifiers (semantic IDs), which represent items as sequences of low-cardinality tokens, have recently emerged as an effective solution to this problem. However, existing approaches generate semantic identifiers of fixed length, assigning the same description length to all items. This is inefficient, misaligned with natural language, and ignores the highly skewed frequency structure of real-world catalogs, where popular items and rare long-tail items exhibit fundamentally different information requirements. In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions. Despite the conceptual similarity, these ideas have not been systematically adopted in recommender systems. In this work, we bridge recommender systems and emergent communication by introducing variable-length semantic identifiers for recommendation. We propose a discrete variational autoencoder with Gumbel-Softmax reparameterization that learns item representations of adaptive length under a principled probabilistic framework, avoiding the instability of REINFORCE-based training and the fixed-length constraints of prior semantic ID methods.

Abstract PDF Upgrade to Chat

Summary

The paper presents a dVAE-based method that constructs variable-length semantic IDs to allocate shorter codes for frequent items and longer codes for rare ones.
The approach leverages Gumbel-Softmax relaxation and a truncated geometric prior to achieve efficient, scalable, and stable training across large datasets.
Experiments show improved recommendation recall and coverage, with variable-length codes optimizing token efficiency under fixed budgets.

Variable-Length Semantic IDs for Recommender Systems

Motivation and Context

Generative modeling is increasingly integral to the architecture of modern recommender systems, where item catalogs reach extreme cardinalities—often multimillion or billion-scale. Traditional vector-based item IDs introduce significant challenges for generative models; both computationally (prohibitive softmax operations) and in the integration with LLMs, resulting in a substantial vocabulary gap between item identifiers and natural-language tokens. Fixed-length semantic IDs address partial issues by mapping items to discrete low-cardinality token sequences, improving efficiency for both generative retrieval and LLM connection. However, fixed-length assignment to all items fails to respect the frequency distribution of real-world catalogs and contravenes the linguistic principle encoded by Zipf’s law of abbreviation, where popular items are described more succinctly.

Emergent communication literature, focusing on agent-based discrete communication games, has demonstrated the advantages of variable-length code schemes—shorter codes for frequent objects, longer codes for rare ones—but such mechanisms have not been adopted systematically in recommendation settings.

Methodology

The paper proposes a principled approach to variable-length semantic ID construction using discrete variational autoencoders (dVAE) with Gumbel-Softmax reparameterization. This circumvents the instability endemic to REINFORCE-based training typical in emergent communication, and eliminates the fixed-length constraints of previous recommendation models. Each item embedding $x \in \mathbb{R}^d$ is represented as a sequence $z = (z_1, ..., z_L)$ of variable length $L \leq T$ , using a shared vocabulary $\mathcal{V}$ . The generative process models $p(x, z, L) = p(L) p(z|L) p(x|z_{1:L})$ , with a truncated geometric prior governing $p(L)$ and a uniform symbol prior $p_{\mathcal{V}}(z_t)$ . The encoder autoregressively samples message tokens and termination probabilities, avoiding dependency on an explicit EOS token.

Optimization leverages the ELBO, with reconstruction, vocabulary, and length regularization terms. The length prior induces a direct message length penalty and ensures frequent items receive shorter codes without explicit popularity conditioning. Gumbel-Softmax relaxation enables efficient, differentiable optimization, with soft residual quantization and KL annealing for stability.

Empirical Evaluation

Experiments utilize large-scale datasets (Yambda, VK-LSVD, Amazon Toys & Games), contrasting the proposed dVAE variable-length model with R-KMeans (fixed-length) and REINFORCE (variable-length) baselines. Metrics include reconstruction loss, token perplexity, code length distribution, recommendation recall/coverage, and scalability.

Efficiency and Quality: Variable-length dVAE achieves comparable semantic reconstruction quality to fixed-length methods with fewer tokens on average. Increasing the length penalty $\lambda$ yields progressively shorter codes, optimizing the trade-off between conciseness and informativeness. Frequent items are consistently represented with shorter codes, and rare/cold items with longer codes, emerging naturally from the ELBO formulation. Catalog- and data-weighted length statistics confirm that popular items dominate with short codes (Table results).

Downstream Impact: Sequential recommendation experiments demonstrate favorable Recall@100 and Coverage@100 for variable-length dVAE models, outperforming fixed-length baselines under a fixed token budget. The shorter codes allow more user-item events to fit within constrained budgets, enhancing both ranking quality and diversity of recommendations.

Training Stability: dVAE shows superior stability and effectiveness compared to REINFORCE approaches in large-scale regimes. REINFORCE training typically collapses codebook utilization and reconstruction metrics, even when LSTM encoders and advanced regularization are employed.

Scalability: Increasing maximum allowed code length ( $T$ ) and vocabulary size results in improved reconstruction quality, with variable-length dVAE maintaining stability and effective utilization of discrete capacity as these parameters scale.

Information Distribution Across Code Positions

Detailed diagnostics show that reconstruction quality improves with each token added (progressive reconstruction), and token perplexity increases for later code positions, indicating increased vocabulary utilization for finer semantic distinction. With long codes (e.g., $maxlen=20$ ), the first 10–15 positions use maximum vocabulary, while late positions collapse to a small set:

Figure 1: Position-wise token perplexity for varlen dVAE with $maxlen=20$ on Yambda, showing maximum diversity at intermediate code positions and rapid collapse for long suffixes.

Theoretical and Practical Implications

The introduction of variable-length semantic IDs aligns discrete representations of items in recommendation with efficient coding principles observed in natural language. By bridging emergent communication and recommendation system methodologies, the framework provides a scalable, stable, and theoretically grounded solution for item representation. It enables resource allocation based on information-theoretic considerations (code length adapts to item frequency), facilitating efficient integration with generative models and LLMs, and enhancing sequential recommendation.

Practically, variable-length semantic IDs serve as a foundational building block for next-generation generative retrieval pipelines. The method's robustness and scalability support application in industrial-scale recommender systems, including music and short-video platforms, and enable more nuanced conversational interfaces with LLMs. Theoretically, this approach advances the understanding of representation learning in large discrete spaces, connecting efficient communication theory and variational inference.

Future Directions

Potential developments include dynamic adjustment of length penalty during training to optimize system-wide efficiency/quality, adaptation to multimodal item representations, and integration with unified cross-modal models for joint search and recommendation. Further investigation into compositionality and transferability of learned semantic vocabularies may facilitate broader use of generative retrieval and conversational recommendation frameworks.

Conclusion

Variable-length semantic IDs provide an efficient coding scheme for recommender systems, allocating representational capacity according to item frequency and enabling favorable efficiency–quality trade-offs. The proposed dVAE-based methodology ensures stable, scalable training and bridges theoretical gaps between emergent communication and recommendation. These advances set the stage for widespread adoption of variable-length semantic representations in generative recommender architectures (2602.16375).