FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets

Published 25 Sep 2025 in cs.IR | (2509.20904v2)

Abstract: Semantic identifiers (SIDs) have gained increasing attention in generative retrieval (GR) due to their meaningful semantic discriminability. However, current research on SIDs faces three main challenges: (1) the absence of large-scale public datasets with multimodal features, (2) limited investigation into optimization strategies for SID generation, which typically rely on costly GR training for evaluation, and (3) slow online convergence in industrial deployment. To address these challenges, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieR in Generative rEtrieval with industrial datasets. Specifically, FORGE is equipped with a dataset comprising 14 billion user interactions and multimodal features of 250 million items sampled from Taobao, one of the biggest e-commerce platforms in China. Leveraging this dataset, FORGE explores several optimizations to enhance the SID construction and validates their effectiveness via offline experiments across different settings and tasks. Further online analysis conducted on the "Guess You Like" section of Taobao's homepage shows a 0.35% increase in transaction count, highlighting the practical impact of our method. Regarding the expensive SID validation accompanied by the full training of GRs, we propose two novel metrics of SID that correlate positively with recommendation performance, enabling convenient evaluations without any GR training. For real-world applications, FORGE introduces an offline pretraining schema that reduces online convergence by half. The code and data are available at https://github.com/selous123/al_sid.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel SID generation framework using multimodal data and pretrained encoders to enable efficient generative retrieval.
It employs advanced optimization techniques including KNN and random-based post-processing to reduce SID collisions and ensure fair distribution.
The evaluation on a massive Taobao dataset demonstrates significant improvements in retrieval hitrate and a reduction in online convergence time.

FORGE: Forming Semantic Identifiers in Industrial Datasets

Introduction

The paper "FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets" (2509.20904) introduces a comprehensive framework aimed at enhancing generative retrieval systems in industrial applications, specifically through the construction and optimization of semantic identifiers (SIDs). SIDs are pivotal in generative retrieval due to their role in encoding meaningful semantic information, essential for accurate item recommendation and retrieval in vast datasets. This work presents several innovations, including a large-scale industrial dataset, SID generation strategies, and efficient online deployment techniques, all designed to address existing limitations in the field.

Figure 1: Overview of FORGE. (a) The generation process of SIDs. Left panel: the training of the text/image encoder.

SID Generation and Optimization

The FORGE framework tackles the SID formation process by leveraging multimodal data, including text and images, to derive rich semantic features. The paper proposes the use of pretrained multimodal LLMs and introduces a novel encoder-quantization approach for item tokenization into semantic identifiers. This method ensures that SIDs are both computationally efficient and semantically meaningful, offering robust representation capabilities across an exponentially large item space.

For optimization, FORGE employs advanced algorithms to mitigate ID collisions, ensuring fair distribution and high utilization of codebooks. These strategies, including KNN-based and random-based post-processing methods, enhance the granularity and fairness of SID assignments, crucial for maintaining consistent retrieval performance.

Figure 2: The effectiveness of embedding hitrate and Gini coefficient for SID evaluation. Left: Positive correlation between embedding hitrate and retrieval performance. Right: Lower Gini coefficient (fairer SID usage) leads to higher HR@500.

Generative Retrieval Model

The generative retrieval component of FORGE integrates the SIDs with LLM architectures such as Qwen and T5, allowing for efficient sequence prediction of recommended items. This approach facilitates the dynamic adaptation of beam search strategies during inference, optimizing computational resource usage while ensuring high-quality generation of retrieval candidates. The paper reports substantial gains in hitrate metrics across multiple configurations, underscoring the efficacy of the proposed SID generation and optimization techniques.

Dataset and Evaluation

FORGE introduces a groundbreaking dataset sourced from Taobao with 14 billion user interactions and multimodal features for 250 million items, vastly surpassing existing benchmarks in scale and diversity. This dataset serves as a valuable tool for offline experiments and online evaluations, providing insights into SID design choices and their impact on generative retrieval systems. Furthermore, FORGE pioneers the use of direct evaluation metrics, such as embedding hitrate and Gini coefficient, offering researchers cost-effective means to assess SID quality without extensive GR training.

Implications and Future Directions

The practical implications of FORGE are profound, with demonstrated improvements in transaction counts and retrieval metrics in real-world applications on Taobao's platform. The proposed offline pretraining schema offers a pragmatic solution for reducing online convergence time by half, highlighting the framework's adaptability and efficiency in industrial deployments.

Future research in this domain could further explore the integration of advanced machine learning techniques for SID generation, including deep reinforcement learning and hierarchical attention mechanisms. Additionally, expanding the framework to accommodate evolving e-commerce scenarios and user behavior patterns will enrich the applicability and robustness of generative retrieval systems.

Conclusion

The FORGE framework marks a significant advancement in the field of generative retrieval by addressing key challenges in SID formation and optimization within industrial datasets. Through its comprehensive benchmark and innovative methodologies, FORGE provides a robust foundation for future developments in semantic identifier construction, with implications extending across both theoretical research and practical applications in large-scale recommendation systems. The release of the dataset and evaluation tools fosters collaborative efforts, encouraging further contributions to this dynamic area of study.

Markdown Report Issue