- The paper presents a two-stage semi-supervised synthetic data generation pipeline (SSRA) that enhances short video search relevance using refined query annotations.
- It employs a four-level relevance control and iterative synthesis process with a score model to address query redundancy and enhance domain-specific data diversity.
- Empirical results demonstrate notable improvements, including a 1.73% nDCG@10 gain and a 2.80% increase in average precision, validating its practical deployment.
Semi-Supervised Synthetic Data Generation with Fine-Grained Relevance Control for Short Video Search Relevance Modeling
Introduction
The paper focuses on enhancing embedding models for short video search by introducing a novel semi-supervised synthetic data generation method with fine-grained relevance control. LLMs have demonstrated potential in improving data diversity and training data quality. However, existing prompt-based synthesis methods have limitations in domain-specific data distributions, especially where fine-grained relevance diversity is crucial but overlooked. This research introduces a structured, two-stage Semi-Supervised Relevance-Aware (SSRA) pipeline designed to generate Chinese short video datasets with four levels of relevance annotations, improving training data semantic richness and model performance.
Semi-Supervised Relevance-Aware Pipeline
The SSRA pipeline is divided into two stages:
- Stage 1: Enhancing Query Diversity
- Stage 2: Refining Relevance Alignment
Methodology
Data Construction: Utilizing a combination of query-driven retrieval and click-based sampling strategies, query-item pairs are aggregated from real user interactions on a major Chinese short video platform. These items, inherently multi-modal, are transformed into textual representations using LLMs to facilitate query-document pair formation.
Relevance Annotation: The dataset is meticulously annotated for relevance on a 4-point scale. Dual-annotation with adjudication ensures consistency and quality in capturing nuanced semantic relevance.
SSRA Framework Implementation:
- Detailed architectures guide the query model and score model training, with progressive refinement emphasizing exact relevance alignment and maintaining real-world query distributions.
- The incorporation of diverse training data and relevance-controlled synthetic queries significantly boosts empirical performance across several metrics.
Experimentation and Analysis
Experiments were conducted on two newly devised benchmarks: a retrieval test set and a pairwise classification test set, both annotated with four-level relevance labels. Noteworthy findings include:
- The Qwen3-Embedding model, when trained with SSRA-generated data, surpassed others with a 1.73% nDCG@10 improvement in retrieval tasks and a 2.80% average precision gain in pair classification.
- Highlighted strong improvements over prompt-based and Vanilla Supervised Fine-Tuning baselines in generating domain-relevant synthetic data.
Ablation studies highlighted that both diversity and precision enhancements through SSRA are pivotal for improved model generalization, further validated by a reduction in query redundancy.
Practical Implications and Online Deployment
The SSRA-trained embedding model was integrated into a substantial online platform analyzed through A/B tests in Douyin's dual-column scenario, exhibiting advances like increased CTR by 1.45%. This showcases substantial gains in personalization through fine-grained relevance supervision.
Conclusion
This research contributes a domain-specific, relevance-sensitive dataset and an overview pipeline, empirically demonstrating the value of semantic relevance in enhancing embedding model performance. The SSRA framework serves as a promising approach to align synthetic data with specific domain distributions, ensuring efficacy in real-world applications. Future exploration may involve task-specific loss function adaptations to further fine-tune embedding models towards distinct relevance diversity objectives.