Mine and Refine: Optimizing Graded Relevance in E-commerce Search Retrieval

Published 19 Feb 2026 in cs.IR and cs.LG | (2602.17654v1)

Abstract: We propose a two-stage "Mine and Refine" contrastive training framework for semantic text embeddings to enhance multi-category e-commerce search retrieval. Large scale e-commerce search demands embeddings that generalize to long tail, noisy queries while adhering to scalable supervision compatible with product and policy constraints. A practical challenge is that relevance is often graded: users accept substitutes or complements beyond exact matches, and production systems benefit from clear separation of similarity scores across these relevance strata for stable hybrid blending and thresholding. To obtain scalable policy consistent supervision, we fine-tune a lightweight LLM on human annotations under a three-level relevance guideline and further reduce residual noise via engagement driven auditing. In Stage 1, we train a multilingual Siamese two-tower retriever with a label aware supervised contrastive objective that shapes a robust global semantic space. In Stage 2, we mine hard samples via ANN and re-annotate them with the policy aligned LLM, and introduce a multi-class extension of circle loss that explicitly sharpens similarity boundaries between relevance levels, to further refine and enrich the embedding space. Robustness is additionally improved through additive spelling augmentation and synthetic query generation. Extensive offline evaluations and production A/B tests show that our framework improves retrieval relevance and delivers statistically significant gains in engagement and business impact.

Abstract PDF Upgrade to Chat

Summary

The paper presents a two-stage framework that combines supervised contrastive loss with a multi-class circle loss to optimize graded relevance in e-commerce search.
It leverages LLM-based re-annotation, hard negative mining, and spelling augmentation to enhance semantic embedding separability and robustness.
The approach achieves significant offline improvements in metrics like NDCG and online gains in add-to-cart and conversion rates.

Authoritative Summary of "Mine and Refine: Optimizing Graded Relevance in E-commerce Search Retrieval" (2602.17654)

Motivation and Problem Setting

The paper addresses graded, policy-driven relevance in embedding-based retrieval (EBR) systems for e-commerce search. E-commerce platforms operate with multicategory catalogs, noisy and long-tail queries, and require robust, scalable semantic search. Exact matches, substitutes, and complements must be properly represented, and retrieval systems must enforce clear boundaries in similarity scores for stable hybrid blending and thresholding. Traditional binary relevance training pipelines are inadequate for capturing these nuanced requirements, and naive hard negative mining can exacerbate label noise, compromising generalization—especially in long tail scenarios.

Mine and Refine Framework

The proposed solution is a two-stage contrastive training pipeline termed "Mine and Refine." The labeling workflow begins with human annotation following a three-level relevance guideline (relevant, moderately relevant, irrelevant), which is scaled via finetuning a lightweight LLM (gpt-4o-mini) achieving $87.6\%$ 3-class accuracy and $98.8\%$ within-1 accuracy. An engagement-guided audit, leveraging more capable LLMs and expert re-annotation, reduces label errors by $5.74\%$ , ensuring policy alignment.

Stage 1 utilizes a Siamese multilingual two-tower encoder, initialized from a pretrained 0.1B parameter backbone and trained with supervised contrastive loss (SupCon). This stage forms a globally robust semantic space.

Stage 2 conducts offline mining of hard cases using ANN retrieval, re-annotates confusable pairs with the policy-aligned LLM, and trains with a novel multi-class extension of circle loss. This loss explicitly sharpens boundaries between relevance strata, addressing similarity score separability for downstream serving and thresholding.

Figure 1: Mine and Refine: Stage 1 uses SupCon, Stage 2 uses circle loss with self-pacing and definite convergence, enforcing strong separation among relevance levels.

Model Design and Data Augmentation

The architecture is based on a Siamese encoder with dual projection heads, facilitating efficient real-time query encoding. Item representations are constructed by concatenating item names with taxonomy paths, enhancing semantic discrimination. Synthetic queries are generated from item features for items lacking positive query-item pairs, and labeled with the finetuned LLM. Robustness to misspellings is achieved by additive spelling augmentation, using NeuSpell's probabilistic noise injection and retaining both clean and noisy variants. This enrichment is shown to improve both recall and precision on misspelled queries, outperforming regularization-based and in-place substitution methods.

Contrastive Objective and Circle Loss Extension

The supervised contrastive loss is extended for three-level relevance, leveraging label-aware batch composition. Circle loss, originally introduced for binary metric learning, is generalized to multi-class relevance. Adaptive weighting and precise margin boundaries between classes ( $\Delta_{2,p},\,\Delta_{1,p},\,\Delta_{1,n},\,\Delta_{0,n}$ ) enforce intra-class compactness and inter-class separability.

Figure 2: Violin plots demonstrate sharper similarity score distribution separability after circle loss refinement.

Offline mining is tuned to extract both hard negatives (label 0 ranking high among ANN results) and hard positives (labels 1/2 ranking low), with careful retention of original negatives to prevent catastrophic forgetting. This pipeline achieves robust geometric calibration of the semantic space, critical for downstream blending and stable ranking.

Empirical Evaluation and Results

Offline experiments are conducted on a "Golden Eval Set" comprising 155M query-item pairs, and side-by-side evaluations with 12K queries deployed in production settings. NDCG@10, Recall@K, and Precision@K metrics are reported. Compared to a strong pretrained encoder baseline ( $>0.6$ B parameters), the Siamese SupCon + Circle Loss model achieves up to $10.39\%$ NDCG@10 improvement and consistent recall/precision gains. Side-by-side evaluations yield $2.32\%$ absolute improvement in NDCG@10 over hybrid baselines.

Online A/B testing over one month, with a $50\%$ traffic split, demonstrates statistically significant lifts: add-to-cart rate $+2.5\%$ , conversion rate $+1.1\%$ , and gross order value $+0.9\%$ ( $p<0.05$ ), with stability in downstream business logic despite only swapping the retriever component.

Figure 3: Violin plots for query-level average margins quantify increased separability of relevant vs. irrelevant item pairs after refinement.

Ablation Studies

The paper conducts extensive ablations on encoder architecture, taxonomy enrichment, synthetic query augmentation, spelling variation strategies, and mining thresholds. Key findings:

Siamese encoders consistently outperform asymmetric dual encoders.
Two-level taxonomy enrichment optimally balances improved category discrimination and semantic noise.
Selective low-ratio synthetic query injection for items without positives, combined with catalog enrichment, boosts both retrieval metrics and similarity margin metrics.
Additive spelling augmentation yields robust improvements; regularization and full substitution degrade performance.
Circle loss is more robust to hard negatives than triplet loss, supporting aggressive mining without training instability.

Implications and Future Directions

The "Mine and Refine" pipeline advances graded relevance calibration in EBR systems under real-world policy constraints. Explicit multi-class margin enforcement and robust offline mining yield models that can reliably support hybrid blending, threshold-based serving, and rankers in large-scale, production e-commerce environments. The empirical gains in online metrics underscore practical business impact.

The methodology is extensible to other domains requiring nuanced retrieval and semantic score calibration (e.g., marketplace, recommendation). Future directions include automating further stages of hard positive augmentation, integrating multimodal signals, and leveraging dynamic LLM-powered relevance annotation with continual feedback.

Conclusion

This paper introduces a structured, two-stage optimization framework for graded relevance in semantic search. The integration of LLM-driven scalable labeling, robust supervised contrastive training, and multi-class circle loss refinement enables stable, high-performing retrieval embeddings. The approach exhibits measurable improvements in both relevance metrics and business objectives, setting a new practical standard for policy-aligned semantic retrieval in e-commerce.

Markdown