Better Generalization with Semantic IDs: A Case Study in Ranking for Recommendations

Published 13 Jun 2023 in cs.IR and cs.LG | (2306.08121v2)

Abstract: Randomly-hashed item ids are used ubiquitously in recommendation models. However, the learned representations from random hashing prevents generalization across similar items, causing problems of learning unseen and long-tail items, especially when item corpus is large, power-law distributed, and evolving dynamically. In this paper, we propose using content-derived features as a replacement for random ids. We show that simply replacing ID features with content-based embeddings can cause a drop in quality due to reduced memorization capability. To strike a good balance of memorization and generalization, we propose to use Semantic IDs -- a compact discrete item representation learned from frozen content embeddings using RQ-VAE that captures the hierarchy of concepts in items -- as a replacement for random item ids. Similar to content embeddings, the compactness of Semantic IDs poses a problem of easy adaption in recommendation models. We propose novel methods for adapting Semantic IDs in industry-scale ranking models, through hashing sub-pieces of of the Semantic-ID sequences. In particular, we find that the SentencePiece model that is commonly used in LLM tokenization outperforms manually crafted pieces such as N-grams. To the end, we evaluate our approaches in a real-world ranking model for YouTube recommendations. Our experiments demonstrate that Semantic IDs can replace the direct use of video IDs by improving the generalization ability on new and long-tail item slices without sacrificing overall model quality.

Abstract PDF HTML Upgrade to Chat

References (28)

Citations (12)

View on Semantic Scholar

Summary

The paper presents Semantic IDs generated via RQ-VAE to replace random IDs, improving generalization in recommendation ranking.
It compares n-gram and SPM-based adaptations, with SPM notably enhancing cold-start and overall CTR performance.
The approach preserves semantic hierarchies in item content, reducing computational overhead and improving user personalization.

Better Generalization with Semantic IDs: A Case Study in Ranking for Recommendations

Introduction

The paper "Better Generalization with Semantic IDs: A Case Study in Ranking for Recommendations" (2306.08121) addresses the prevalent issue in recommendation systems where randomly-hashed item IDs are used, which limits the ability to generalize across similar items, especially in large, evolving item corpora. To tackle this, the authors propose the use of content-derived Semantic IDs to replace random IDs, aiming to strike a balance between memorization and generalization while maintaining model quality.

Methodology

Semantic ID Generation with RQ-VAE

The cornerstone of the proposed methodology is the generation of Semantic IDs using RQ-VAE, a Residual-Quantized Variational AutoEncoder. This process involves encoding content embeddings into discrete, hierarchically-structured IDs that preserve semantic relationships among items. The RQ-VAE uses a multi-level quantization process to convert dense item representations into compact, discrete codes, which effectively capture the hierarchical nature of item concepts.

Figure 1: Illustration of RQ-VAE: The input vector is encoded into a latent , quantized into Semantic ID (1,4,6,2), representing hierarchical concepts.

Adapting Semantic IDs in Ranking Models

The adaptation of Semantic IDs into ranking models is performed through two main strategies: n-gram-based and SentencePiece Model (SPM)-based adaptations. The former groups semantic codes into fixed-length n-grams, while the latter uses variable-length subword units dynamically learned from the item distribution. This flexibility allows SPM to more effectively manage embedding table entries, balancing memorization needs and generalization abilities.

Experiments and Results

The proposed approach was tested in a real-world YouTube video recommendation scenario, comparing the performance of Semantic IDs against traditional random hashing and direct content embeddings. The experiments demonstrated that Semantic IDs, particularly when adapted using SPM, significantly improved cold-start performance and overall recommendation quality by enhancing generalization capabilities without sacrificing memorization.

Performance Metrics

Several key performance metrics were used to evaluate the models:

CTR AUC: Measures the overall click-through rate (CTR) across the dataset.
CTR/1D AUC: Focuses on the ability to generalize to new, never-before-seen items introduced daily.

These metrics highlighted the superior performance of SPM-based Semantic IDs in improving model generalization, especially evident in the CTR/1D AUC results.

Figure 2: Overall CTR AUC showing improvements with Semantic IDs.

Figure 3: CTR/1D AUC indicating enhanced generalization to cold-start items.

Semantic Hierarchies

Semantic IDs also demonstrated the ability to capture meaningful hierarchical structures in item categories, such as sports or food vlogging videos, thus providing better contextual recommendations.

Figure 4: A sub-trie capturing sports videos' hierarchical structures with Semantic IDs.

Discussion and Implications

The introduction of Semantic IDs addresses several limitations of traditional ID-based recommendation systems, including item sparsity and memorization constraints. By leveraging structured semantic representations, the approach not only enhances cold-start performance but also reduces computational overhead compared to dense content embeddings.

The implications for large-scale industrial recommendation systems are notable. Adopting Semantic IDs could lead to improved user personalization and engagement by facilitating dynamic adaptation to shifting content distributions. Moreover, the ability to retain semantic hierarchies opens avenues for more nuanced content discovery and recommendation strategies.

Conclusion

The study successfully demonstrates that Semantic IDs offer a viable solution for improving generalization in recommendation models. By integrating these compact, semantically-aware representations, systems can better manage evolving item corpora and enhance user experience through more accurate and contextually relevant recommendations. Future work could explore optimizing the training of RQ-VAE models and expanding the application of Semantic IDs across other domains beyond video recommendations.