Papers
Topics
Authors
Recent
Search
2000 character limit reached

Explainable LLM-driven Multi-dimensional Distillation for E-Commerce Relevance Learning

Published 20 Nov 2024 in cs.IR, cs.AI, and cs.CL | (2411.13045v2)

Abstract: Effective query-item relevance modeling is pivotal for enhancing user experience and safeguarding user satisfaction in e-commerce search systems. Recently, benefiting from the vast inherent knowledge, LLM approach demonstrates strong performance and long-tail generalization ability compared with previous neural-based specialized relevance learning methods. Though promising, current LLM-based methods encounter the following inadequacies in practice: First, the massive parameters and computational demands make it difficult to be deployed online. Second, distilling LLM models to online models is a feasible direction, but the LLM relevance modeling is a black box, and its rich intrinsic knowledge is difficult to extract and apply online. To improve the interpretability of LLM and boost the performance of online relevance models via LLM, we propose an Explainable LLM-driven Multi-dimensional Distillation framework for e-commerce relevance learning, which comprises two core components: (1) An Explainable LLM for relevance modeling (ELLM-rele), which decomposes the relevance learning into intermediate steps and models relevance learning as a Chain-of-Thought (CoT) reasoning, thereby enhancing both interpretability and performance of LLM. (2) A Multi-dimensional Knowledge Distillation (MKD) architecture that transfers the knowledge of ELLM-rele to current deployable interaction-based and representation-based student models from both the relevance score distribution and CoT reasoning aspects. Through distilling the probabilistic and CoT reasoning knowledge, MKD improves both the semantic interaction and long-tail generalization abilities of student models. Extensive offline evaluations and online experiments on Taobao search ad scene demonstrate that our proposed framework significantly enhances e-commerce relevance learning performance and user experience.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of unresolved issues and concrete research questions that emerge from the paper and could guide future work:

  • Reproducibility is limited: the dataset is proprietary (Taobao ad search logs), no public benchmark or code is provided, and Appendix A/B/C (aspect schema, long-tail and case analyses) are not available; can the approach be validated on open datasets and shared toolchains?
  • Aspect schema definition is underspecified: the precise set of attributes (e.g., category, brand, gender, etc.), their extraction rules, coverage, and ambiguity resolution are not detailed; how should the aspect taxonomy be designed, maintained, and adapted to evolving business needs?
  • CoT annotation quality is not evaluated beyond matching final labels: there is no measurement of correctness, faithfulness, consistency, or inter-annotator agreement of the generated reasoning steps; how accurate and reliable are the CoT rationales, and do they truly reflect evidence rather than post-hoc justifications?
  • CoT selection procedure may induce bias: among 10 self-consistency samples, the paper selects the one whose final judgment matches the human label, but does not address cases where none match, nor analyze the impact of selection bias on training; what is the failure rate and how should contradictory CoTs be handled?
  • Noise sensitivity of CoT-based distillation is not studied: the effect of erroneous or hallucinated CoTs on student models (CRF tagging and attention regulation) is unknown; how robust is MKD to label noise in CoT tags, and can confidence-weighting or denoising improve stability?
  • Score distribution distillation is under-specified: the mapping from teacher token probabilities to a student “score distribution” via KL divergence is ambiguous (teacher provides a scalar after normalization over Good/Bad; student outputs a score or logits); how should the distributions be formed and calibrated, and does temperature scaling consistently help?
  • Calibration and uncertainty are not evaluated: the teacher and student scores’ calibration, confidence thresholds, and their use in filtering pseudo-labels are not reported; can uncertainty-aware distillation improve performance and reduce error propagation?
  • Pseudo-labeling at scale lacks quality controls: 30M unlabeled pairs are used without describing confidence thresholds, sampling strategies, domain balancing, or noise mitigation; what selection strategies maximize gains while minimizing error amplification?
  • Limited offline metrics: the paper reports ROC-AUC and Neg PR-AUC for binary classification but does not evaluate ranking metrics (e.g., NDCG@k, MRR) that better reflect search quality; how does ELLM-MKD affect ranking performance across candidate lists?
  • Online A/B test lacks statistical detail: the reported CTR (+0.17%) and Goodrate (+0.89% overall, +1.96% long-tail) improvements do not include confidence intervals, p-values, or variance; are the gains statistically robust and consistent across cohorts?
  • Generalization across domains, languages, and scripts is untested: data are Chinese e-commerce titles/queries; can the approach generalize to English or multilingual marketplaces, and to domains with different attribute structures?
  • Faithfulness vs. plausibility of explanations is unvalidated: beyond interpretability claims, there is no user study or extrinsic evaluation demonstrating that CoT outputs aid annotators, debugging, or trust; how useful and trustworthy are explanations to real stakeholders?
  • Reliance on unstructured item titles only: the method ignores structured catalog attributes (e.g., category taxonomy, brand metadata) and multimodal signals (images, specs); can structured/multimodal inputs strengthen CoT reasoning and distillation?
  • Adaptivity to evolving taxonomies and new attributes is unclear: CoT and distillation depend on a fixed aspect schema; what mechanisms allow incremental updates, onboarding new attributes, and handling dynamic catalogs without full retraining?
  • Integration into multi-stage retrieval pipelines is not explored: the impact of MKD on candidate generation vs. re-ranking stages, and potential cascading effects, are not analyzed; where in the pipeline does CoT-derived interaction knowledge yield maximal ROI?
  • Representation-based student training requires cross-attention during training: attention regulation introduces token-level cross-interactions at train time, but inference remains dual-encoder; how does this added training complexity scale, and is there a simpler proxy that retains gains?
  • Hyperparameter sensitivity and optimization stability are not studied: the weights (λ1, λ2, λ3), KL temperature, tagging scheme, and pooling choices could materially affect outcomes; what are the optimal settings and their robustness across models and datasets?
  • Distillation to smaller student sizes is not tested: all students are ~101M parameters; can MKD reliably benefit much smaller models (e.g., <50M) needed for strict latency budgets or edge deployment?
  • Handling noisy, short, or adversarial queries is unaddressed: misspellings, slang, code-mixing, and adversarial inputs common in e-commerce search are not evaluated; how robust are ELLM-rele and MKD under such perturbations?
  • Long-tail analysis is referenced but not detailed: the types of long-tail (rare queries, novel products, new brands), their distributions, and per-category performance breakdowns are missing; which long-tail segments benefit most and why?
  • Multi-class judgments beyond Good/Bad are not leveraged: although CoT can express fine-grained reasons (e.g., “Brand mismatch”), the final distillation reduces to binary tokens; can multi-class or structured outputs be exploited to teach richer decision boundaries?
  • Data and prompt design are only partially described: the exact system prompt S, few-shot examples E, and parsing rules for CoT tags are not included; how do prompt variants and example selection influence annotation quality and downstream gains?
  • Teacher model selection and capacity scaling are not analyzed: CoT annotations are generated with Qwen2-72B and Llama3-70B and distilled into a 7B teacher; what is the trade-off between teacher size, annotation quality, cost, and student performance?
  • Privacy and sensitive attribute use are not considered: CoT examples include gender; there is no discussion of fairness/bias, protected attributes, or compliance; how to ensure equitable relevance judgments and avoid discriminatory reasoning?
  • Lifecycle and maintenance strategy is unspecified: how often should teachers be updated, CoTs regenerated, and students re-distilled to track catalog drift and seasonality? What are the operational costs and best practices?
  • Efficiency measurements for LLM inference are missing: Table 3 omits concrete LLM training/inference times (dashes), despite noting increased CoT output length; what are the actual latencies and throughput, and how do they affect offline workflows?
  • Negative sampling and training objectives for students are under-detailed: e-commerce relevance often benefits from pairwise/triplet losses and hard negatives; how do different objectives and sampling strategies interact with MKD?
  • Interaction between score and CoT distillation is only ablated at high level: the joint optimization dynamics, potential conflicts, and curriculum strategies (e.g., staging score vs. CoT distillation) are not explored; can coordinated scheduling improve convergence and gains?

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 3 likes about this paper.