Papers
Topics
Authors
Recent
Search
2000 character limit reached

OneSearch: A Preliminary Exploration of the Unified End-to-End Generative Framework for E-commerce Search

Published 3 Sep 2025 in cs.IR | (2509.03236v1)

Abstract: Traditional e-commerce search systems employ multi-stage cascading architectures (MCA) that progressively filter items through recall, pre-ranking, and ranking stages. While effective at balancing computational efficiency with business conversion, these systems suffer from fragmented computation and optimization objective collisions across stages, which ultimately limit their performance ceiling. To address these, we propose \textbf{OneSearch}, the first industrial-deployed end-to-end generative framework for e-commerce search. This framework introduces three key innovations: (1) a Keyword-enhanced Hierarchical Quantization Encoding (KHQE) module, to preserve both hierarchical semantics and distinctive item attributes while maintaining strong query-item relevance constraints; (2) a multi-view user behavior sequence injection strategy that constructs behavior-driven user IDs and incorporates both explicit short-term and implicit long-term sequences to model user preferences comprehensively; and (3) a Preference-Aware Reward System (PARS) featuring multi-stage supervised fine-tuning and adaptive reward-weighted ranking to capture fine-grained user preferences. Extensive offline evaluations on large-scale industry datasets demonstrate OneSearch's superior performance for high-quality recall and ranking. The rigorous online A/B tests confirm its ability to enhance relevance in the same exposure position, achieving statistically significant improvements: +1.67\% item CTR, +2.40\% buyer, and +3.22\% order volume. Furthermore, OneSearch reduces operational expenditure by 75.40\% and improves Model FLOPs Utilization from 3.26\% to 27.32\%. The system has been successfully deployed across multiple search scenarios in Kuaishou, serving millions of users, generating tens of millions of PVs daily.

Summary

  • The paper presents an innovative unified generative framework that replaces multi-stage architectures with a single end-to-end model for e-commerce search.
  • It details a novel methodology combining keyword-enhanced hierarchical quantization encoding and multi-view behavior sequence injection to effectively capture user intent.
  • Empirical results demonstrate significant improvements in recall, CTR, and resource efficiency, underscoring the framework's industrial viability and potential to streamline search systems.

Introduction and Motivation

The paper introduces OneSearch, an industrial-scale, end-to-end generative retrieval framework for e-commerce search, designed to address the inherent limitations of traditional multi-stage cascading architectures (MCA). In conventional e-commerce search systems, the MCA paradigm segments retrieval into recall, pre-ranking, and ranking stages, each optimized for different objectives and computational constraints. This fragmentation leads to suboptimal global performance due to objective collisions, inefficient resource utilization, and limited ability to model user intent holistically.

OneSearch proposes a unified generative approach that directly maps user queries and behavioral context to item candidates, eliminating the need for multi-stage filtering and enabling joint optimization of relevance and personalization. The framework is deployed at scale on the Kuaishou platform, serving millions of users and demonstrating significant improvements in both offline and online metrics. Figure 1

Figure 1: (a) The proposed End-to-End generative retrieval framework (OneSearch), (b) the traditional multi-stage cascading architecture in E-commerce search.

System Architecture and Key Innovations

OneSearch is architected around four principal components:

  1. Keyword-Enhanced Hierarchical Quantization Encoding (KHQE): This module encodes items and queries into semantic IDs (SIDs) using a hierarchical quantization schema, augmented with core keyword extraction to preserve essential attributes and suppress irrelevant noise. The encoding pipeline combines RQ-Kmeans for hierarchical clustering and OPQ for fine-grained residual quantization, ensuring high codebook utilization and independent coding rates.
  2. Multi-view Behavior Sequence Injection: User modeling is achieved by integrating explicit short-term and implicit long-term behavioral sequences. User IDs are constructed from weighted aggregations of recent and historical clicked items, and both short and long behavior sequences are injected into the model via prompt engineering and embedding aggregation, respectively. This multi-view approach enables comprehensive personalization.
  3. Unified Encoder-Decoder Generative Model: The system employs a transformer-based encoder-decoder architecture (e.g., BART, mT5, or Qwen3) to jointly model user, query, and behavioral context, generating item SIDs as output. The model is trained with a combination of supervised fine-tuning and preference-aware reinforcement learning.
  4. Preference-Aware Reward System (PARS):

A multi-stage supervised fine-tuning process aligns semantic and collaborative representations, followed by an adaptive reward system that leverages hierarchical user behavior signals and list-wise preference optimization. The reward model is trained on real user interactions, incorporating CTR, CVR, and relevance signals. Figure 2

Figure 2: The OneSearch framework: (1) KHQE for semantic encoding, (2) multi-view behavior sequence injection, (3) unified encoder-decoder generative retrieval, (4) preference-aware reward system.

Hierarchical Quantization Encoding and Tokenization

The KHQE module addresses the challenge of representing items with long, noisy, and weakly ordered textual descriptions. By extracting core keywords using NER and domain-specific heuristics, the encoding process emphasizes essential attributes (e.g., brand, category) and suppresses irrelevant tokens. The hierarchical quantization pipeline operates as follows:

  • RQ-Kmeans: Hierarchically clusters item embeddings, maximizing codebook utilization and independent coding rates.
  • OPQ: Quantizes residual embeddings to capture fine-grained, item-specific features.
  • Core Keyword Enhancement: Core keywords are embedded and averaged with item representations, further improving the discriminative power of SIDs.

Empirical results demonstrate that this approach yields higher recall and ranking performance compared to standard RQ-VAE or balanced k-means tokenization, with significant improvements in codebook utilization and independent coding rates. Figure 3

Figure 3: Different hierarchical quantization encodings of items, illustrating the impact of KHQE and OPQ on SID assignment.

Multi-view User Behavior Modeling

OneSearch's user modeling strategy integrates three perspectives:

  • Behavior Sequence-Constructed User IDs:

User IDs are computed as weighted sums of SIDs from recent and long-term clicked items, providing a semantically meaningful and behaviorally grounded identifier.

  • Explicit Short Behavior Sequences:

Recent queries and clicked items are explicitly included in the model prompt, enabling the model to capture short-term intent shifts.

  • Implicit Long Behavior Sequences:

Long-term behavioral patterns are aggregated via centroid embeddings at multiple quantization levels, efficiently encoding user profiles without excessive prompt length.

Ablation studies confirm that sequence-constructed user IDs and explicit/implicit behavior sequence injection yield substantial gains in both recall and ranking metrics, outperforming random or hashed user ID baselines.

Unified Generative Retrieval and Training Paradigm

The encoder-decoder model ingests the full user context and outputs item SIDs via constrained or unconstrained beam search. Training proceeds in three supervised fine-tuning stages:

  1. Semantic Content Alignment: Aligns SIDs with textual descriptions and category information.
  2. Co-occurrence Synchronization: Models collaborative relationships between queries and items.
  3. User Personalization Modeling: Incorporates user IDs and behavior sequences for personalized generation.

A sliding window data augmentation strategy is applied to short behavior sequences, enhancing the model's ability to generalize to users with limited history.

Preference-Aware Reward System and Hybrid Ranking

The reward system is designed to optimize both relevance and conversion objectives:

  • Adaptive Reward Signal:

User interactions are categorized into six levels, with adaptive weights derived from calibrated CTR and CVR metrics. The reward model is a three-tower architecture predicting CTR, CVR, and CTCVR, with an additional relevance score.

List-wise DPO training is used to align the generative model's output with the reward model's ranking, followed by further fine-tuning on pure user interaction data to overcome the limitations of reward model distillation.

This hybrid approach enables OneSearch to achieve a Pareto-optimal balance between relevance and personalization, surpassing the performance ceiling of traditional MCA systems.

Experimental Results

Offline Evaluation

OneSearch is evaluated on a large-scale industry dataset from Kuaishou's mall search platform. Key findings include:

  • Recall and Ranking:

OneSearch achieves higher recall (HR@350) and comparable or superior ranking (MRR@350) compared to the online MCA baseline.

  • Ablation Studies:

KHQE, OPQ, and multi-view behavior sequence injection each contribute significant performance gains. The system is robust to item pool changes, maintaining high codebook utilization and independent coding rates over time. Figure 4

Figure 4

Figure 4: ICR and SID ratio indicators of RQ-Kmeans over time, demonstrating stability under dynamic item pool conditions.

Online A/B Testing

Deployed on the Kuaishou platform, OneSearch demonstrates:

  • CTR and Conversion Gains:

Statistically significant improvements: +1.67% item CTR, +2.40% buyers, +3.22% order volume.

  • Resource Efficiency:

Model FLOPs Utilization increases from 3.26% (MCA) to 27.32% (OneSearch), and operational expenditure is reduced by 75.40%. Figure 5

Figure 5: Comparisons of MFU and OPEX for onlineMCA and OneSearch, highlighting substantial resource efficiency improvements.

  • Industry and Query Coverage:

Gains are observed across 28 of the top 30 industries and for queries of all popularity levels, including long-tail queries. Figure 6

Figure 6: Online CTR relative gains for the top 30 industries, showing broad applicability of OneSearch.

  • Manual Evaluation:

Increases in page good rate, item quality, and query-item relevance, confirming improvements in user experience.

Implications and Future Directions

OneSearch demonstrates that unified, end-to-end generative retrieval can replace complex, fragmented MCA pipelines in industrial e-commerce search, yielding improvements in both user engagement and system efficiency. The framework's modular design—combining advanced quantization, multi-view user modeling, and preference-aware optimization—enables robust adaptation to dynamic item pools and evolving user behavior.

Key implications:

  • Practical:

OneSearch reduces operational complexity, improves hardware utilization, and enhances user experience at scale.

  • Theoretical:

The results challenge the necessity of multi-stage architectures for large-scale retrieval, suggesting that joint modeling of relevance and personalization is feasible and beneficial.

Future work should focus on real-time tokenization for streaming data, further reinforcement learning for preference alignment, and integration of multi-modal item features (e.g., images, video) to enhance semantic understanding and reasoning.

Conclusion

OneSearch establishes a new paradigm for e-commerce search by unifying retrieval and ranking in a single generative model, leveraging hierarchical quantization, multi-view user modeling, and preference-aware optimization. Extensive offline and online evaluations confirm its superiority over traditional MCA systems in both effectiveness and efficiency. The deployment at scale on Kuaishou demonstrates its industrial viability and sets a benchmark for future research in generative retrieval for search and recommendation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper. Each point is phrased to be directly actionable for future research.

  • Data transparency and reproducibility: Provide full dataset specifications (query and item counts, time windows, language mix, category distribution, prevalence of long-tail queries/items, and cold-start proportions), train/validation/test splits, and sampling strategies used across SFT stages and reward-model training.
  • Detailed evaluation protocol: Report comprehensive offline metrics (e.g., NDCG, MAP, recall/precision at multiple cutoffs, calibration metrics) beyond Recall@10/MRR@10, along with per-query breakdowns (short vs. long queries, head vs. tail queries, attribute-heavy vs. minimal queries) to evidence relevance and personalization.
  • Statistical rigor of A/B tests: Document sample sizes, traffic allocation per scenario, confidence intervals, and exact statistical tests (including p-values) for all online metrics to substantiate “statistically significant” improvements.
  • Latency and serving SLOs: Quantify end-to-end latency (mean, P95, P99), throughput, and memory footprint under production loads for constrained vs. unconstrained beam search, and compare to MCA across different search entry points (homepage, mall, detail page).
  • SID-to-item resolution and collision handling: Specify how generated SIDs map to concrete items when multiple items share hierarchical SIDs, the role of OPQ codes in disambiguation, fallback strategies for invalid/unavailable/out-of-stock items, and tie-breaking rules in list construction.
  • Vocabulary/codebook dynamics under catalog churn: Describe incremental or online codebook update strategies, SID stability under frequent item additions/removals (e.g., large shopping festivals), and mechanisms to avoid catastrophic remapping or drift.
  • Missing attribute taxonomy and NER reliability: The paper references “18 structured attributes” but does not enumerate them. Provide the exact attribute list, NER model details, accuracy/coverage per attribute, domain-specific errors (e.g., multi-attribute conflicts), and the impact of NER failures on KHQE effectiveness.
  • Keyword extraction robustness: Evaluate Qwen-VL’s keyword extraction error rates, domain shift (e.g., noisy seller content, multi-lingual listings), adversarial/spam susceptibility, and performance under incomplete or image-only item content.
  • Query understanding limits: Assess handling of negation (e.g., “not brand X”), compositional constraints (color + size + material), typos/code-switching/morphological variants, and rare attribute values; specify any explicit constraint-checking beyond SID matching.
  • Cold-start performance: Include targeted experiments for cold-start users/items/queries, quantify gains vs. MCA, and analyze the impact of default behavior sequences (popularity-based) on personalization quality and popularity bias.
  • Bias and debiasing in reward signals: CTR/CVR-based rewards are exposure-biased. Introduce counterfactual estimators (e.g., IPS/DR), position bias correction, and temporal drift handling; compare calibrated vs. debiased rewards and measure fairness impacts (e.g., across sellers/categories).
  • Diversity and serendipity: Analyze whether end-to-end generation reduces catalog diversity or increases homogeneity; introduce and report diversity metrics (e.g., intra-list diversity, coverage) and trade-offs with CTR/CVR.
  • Safety and business constraints: Clarify how OneSearch enforces business rules (e.g., seller fairness, regulatory constraints, prohibited items), avoids unsafe generations, and integrates inventory/price changes and deduplication in real time.
  • Multi-modal utilization gaps: Beyond keyword extraction via Qwen-VL and OCR, the generative model does not appear to ingest image/video features end-to-end. Explore multi-modal encoders for item and query representations and quantify gains.
  • Ablation and sensitivity analyses: Present ablations isolating the impact of KHQE (keyword enhancement, RQ layers, L3-balanced k-means), OPQ code size choices, short vs. long sequence injections, and reward model weights (including the amplified relevance term; 10× λ4), with sensitivity curves.
  • Constrained vs. unconstrained decoding: Provide empirical comparisons on accuracy, relevance, invalid SID rate, diversity, and latency trade-offs across beam widths and constraints; define fallback mechanisms for unconstrained decoding.
  • SFT stage interdependencies: Detail training schedules, hyperparameters, sample sizes per stage, curriculum effects, and convergence diagnostics; quantify how each SFT stage contributes to final relevance/personalization and whether skipping stages degrades performance.
  • Stability and continual learning: Investigate catastrophic forgetting and stability when updating models/reward signals, and propose/measure continual learning strategies under shifting user behavior and catalog seasonality.
  • Reward model architecture and training details: Specify the three-tower model architecture, features, hyperparameters, negative sampling strategy, and the exact computation of the offline relevance score SRel; compare using the MCA ranking model as a proxy vs. bespoke reward training.
  • List-wise DPO training details: Define δ, α, reference model selection, negative set construction, and training stability; report the impact of list-wise DPO vs. pairwise alternatives on ranking quality and robustness.
  • Generalization across domains and languages: Evaluate transferability to other marketplaces, languages, and verticals with different attribute structures; analyze adaptation costs and performance under cross-domain deployment.
  • Privacy and compliance: Discuss storage/retention of long behavior sequences, user-ID construction from behavior, opt-out mechanisms, and compliance with privacy regulations; quantify privacy-preserving alternatives (e.g., on-device computation, federated learning).
  • Failure case analysis: Provide qualitative and quantitative analyses of typical failure modes (e.g., attribute mismatch, brand mismatches, overfitting to recent clicks, poor tail-query handling) to guide targeted mitigation strategies.
  • MFU/OPEX claims: Define Model FLOPs Utilization explicitly (measurement method, baselines, hardware specs) and decompose OPEX savings (compute, storage, network, engineering overhead) to enable independent validation and replication.
  • Benchmarking vs. strong baselines: Compare OneSearch end-to-end performance against state-of-the-art GR baselines tailored to search (e.g., GRAM, GenR-PO) and strong MCA variants, using unified protocols and identical traffic slices.
  • Operational integration details: Document caching strategies, user-state freshness, consistency across microservices, failure recovery, and monitoring/alerting for production deployment to inform reliability engineering practices.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (21)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 15 likes about this paper.