Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semi-Supervised Synthetic Data Generation with Fine-Grained Relevance Control for Short Video Search Relevance Modeling

Published 20 Sep 2025 in cs.CL | (2509.16717v1)

Abstract: Synthetic data is widely adopted in embedding models to ensure diversity in training data distributions across dimensions such as difficulty, length, and language. However, existing prompt-based synthesis methods struggle to capture domain-specific data distributions, particularly in data-scarce domains, and often overlook fine-grained relevance diversity. In this paper, we present a Chinese short video dataset with 4-level relevance annotations, filling a critical resource void. Further, we propose a semi-supervised synthetic data pipeline where two collaboratively trained models generate domain-adaptive short video data with controllable relevance labels. Our method enhances relevance-level diversity by synthesizing samples for underrepresented intermediate relevance labels, resulting in a more balanced and semantically rich training data set. Extensive offline experiments show that the embedding model trained on our synthesized data outperforms those using data generated based on prompting or vanilla supervised fine-tuning(SFT). Moreover, we demonstrate that incorporating more diverse fine-grained relevance levels in training data enhances the model's sensitivity to subtle semantic distinctions, highlighting the value of fine-grained relevance supervision in embedding learning. In the search enhanced recommendation pipeline of Douyin's dual-column scenario, through online A/B testing, the proposed model increased click-through rate(CTR) by 1.45%, raised the proportion of Strong Relevance Ratio (SRR) by 4.9%, and improved the Image User Penetration Rate (IUPR) by 0.1054%.

Summary

  • The paper presents a two-stage semi-supervised synthetic data generation pipeline (SSRA) that enhances short video search relevance using refined query annotations.
  • It employs a four-level relevance control and iterative synthesis process with a score model to address query redundancy and enhance domain-specific data diversity.
  • Empirical results demonstrate notable improvements, including a 1.73% nDCG@10 gain and a 2.80% increase in average precision, validating its practical deployment.

Semi-Supervised Synthetic Data Generation with Fine-Grained Relevance Control for Short Video Search Relevance Modeling

Introduction

The paper focuses on enhancing embedding models for short video search by introducing a novel semi-supervised synthetic data generation method with fine-grained relevance control. LLMs have demonstrated potential in improving data diversity and training data quality. However, existing prompt-based synthesis methods have limitations in domain-specific data distributions, especially where fine-grained relevance diversity is crucial but overlooked. This research introduces a structured, two-stage Semi-Supervised Relevance-Aware (SSRA) pipeline designed to generate Chinese short video datasets with four levels of relevance annotations, improving training data semantic richness and model performance.

Semi-Supervised Relevance-Aware Pipeline

The SSRA pipeline is divided into two stages:

  1. Stage 1: Enhancing Query Diversity
    • This stage tackles the issue of reduced query diversity due to repeated high-frequency queries linked to various documents.
    • A score-based re-annotation strategy enables associating each document with multiple queries of varying relevance levels, identified via a tuned score model. Figure 1

      Figure 1: The overview of the proposed two-stage Semi-Supervised Relevance-Aware data synthesis(SSRA) pipeline.

  2. Stage 2: Refining Relevance Alignment
    • Focus is placed on refining the precision of relevance-conditioned query generation.
    • An iterative synthesis process generates queries conditioned on specific relevance labels, following which a score model filters these to ensure fidelity with target relevance. Figure 2

      Figure 2: Illustration of our score-based re-annotation strategy in Stage 1.

Methodology

Data Construction: Utilizing a combination of query-driven retrieval and click-based sampling strategies, query-item pairs are aggregated from real user interactions on a major Chinese short video platform. These items, inherently multi-modal, are transformed into textual representations using LLMs to facilitate query-document pair formation.

Relevance Annotation: The dataset is meticulously annotated for relevance on a 4-point scale. Dual-annotation with adjudication ensures consistency and quality in capturing nuanced semantic relevance.

SSRA Framework Implementation:

  • Detailed architectures guide the query model and score model training, with progressive refinement emphasizing exact relevance alignment and maintaining real-world query distributions.
  • The incorporation of diverse training data and relevance-controlled synthetic queries significantly boosts empirical performance across several metrics.

Experimentation and Analysis

Experiments were conducted on two newly devised benchmarks: a retrieval test set and a pairwise classification test set, both annotated with four-level relevance labels. Noteworthy findings include:

  • The Qwen3-Embedding model, when trained with SSRA-generated data, surpassed others with a 1.73% nDCG@10 improvement in retrieval tasks and a 2.80% average precision gain in pair classification.
  • Highlighted strong improvements over prompt-based and Vanilla Supervised Fine-Tuning baselines in generating domain-relevant synthetic data.

Ablation studies highlighted that both diversity and precision enhancements through SSRA are pivotal for improved model generalization, further validated by a reduction in query redundancy.

Practical Implications and Online Deployment

The SSRA-trained embedding model was integrated into a substantial online platform analyzed through A/B tests in Douyin's dual-column scenario, exhibiting advances like increased CTR by 1.45%. This showcases substantial gains in personalization through fine-grained relevance supervision.

Conclusion

This research contributes a domain-specific, relevance-sensitive dataset and an overview pipeline, empirically demonstrating the value of semantic relevance in enhancing embedding model performance. The SSRA framework serves as a promising approach to align synthetic data with specific domain distributions, ensuring efficacy in real-world applications. Future exploration may involve task-specific loss function adaptations to further fine-tune embedding models towards distinct relevance diversity objectives.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.