- The paper proposes a differentially private fine-tuning approach to generate synthetic queries that boost retrieval performance while protecting user privacy.
- It employs DP-Adafactor for conditional query generation and trains dual encoder models using in-batch softmax loss.
- Empirical results on MSMARCO and BEIR benchmarks show significant improvements in NDCG@10 and Recall@10 compared to direct DP-training.
Synthetic Query Generation for Privacy-Preserving Deep Retrieval Systems using Differentially Private LLMs
The paper "Synthetic Query Generation for Privacy-Preserving Deep Retrieval Systems using Differentially Private LLMs" presents a structured methodology to ensure differential privacy guarantees when training deep retrieval systems. The motivation behind this research stems from the challenge that deep retrieval systems, often utilized in online services, can inadvertently compromise user privacy since input queries may contain personal information. To address this, the paper introduces an approach using differentially private (DP) LMs to generate synthetic queries that preserve privacy while enhancing retrieval quality.
Introduction and Problem Statement
Deep retrieval systems map user queries to relevant responses or recommendations by learning semantic representations. Typically, these systems leverage contrastive-style loss functions that are not per-example decomposable, which complicates direct DP-training since such training methods need per-example gradients to function effectively. Given that neural retrieval models memorize and potentially expose sensitive user data, there is a heightened need for privacy mechanisms in these systems. This paper focuses on circumventing the limitations of traditional DP-training by utilizing DP LMs to produce representative synthetic queries, ensuring privacy before the training of dual encoder retrieval models.
Methodology
Differentially Private Fine-Tuning
The proposed method commences with the DP fine-tuning of a LLM using a conditional query generation task. Using algorithms like DP-Adafactor, the model learns to generate queries given associated item documents, mapping documents to semantically coherent query templates. This step prioritizes acquiring privacy for user queries through DP mechanisms embedded into the training process of the LMs.
Synthetic Query Generation
Once the LM is fine-tuned, it can generate synthetic queries by inputting document contexts and stimulating query generation. These queries, although synthetic, serve as practical substitutes for original user data during the subsequent training stages. By leveraging synthetic data, retrieval systems can undergo traditional training devoid of the privacy constraints imposed by DP methods.
Dual Encoder Training
Finally, synthetic data undergoes the conventional training of a dual encoder model that uses in-batch softmax loss. This guarantees query-level privacy through synthetic data while capitalizing on enhanced retrieval performance since synthetic data embodies both the diversity and privacy protection of the original dataset.
Implementation and Evaluation
Experimental Setup
The methodology was evaluated using publicly accessible datasets such as MSMARCO and datasets from the BEIR benchmark suite. Models of varying sizes, specifically the T5 model family, were DP-trained to generate synthetic queries with different privacy guarantees (ϵ settings from 3 to 16). The results affirm that using synthetic data in training retrieval models outperforms direct DP-training on formal data. For instance, models trained on synthetic data significantly improved NDCG@10 and Recall@10 metrics vis-a-vis directly DP-trained models on original data.
Zero-shot Generalization
Synthetic data-driven models also displayed exceptional zero-shot generalization capabilities. When evaluated against datasets from diverse knowledge domains, these models nearly matched, and sometimes surpassed, the efficacy of models trained on non-private data.
Practical Implications and Challenges
This approach establishes a new paradigm for training privacy-preserving ML models by integrating synthetic data generation into the privacy framework. It demonstrates how LMs can facilitate the development of retrieval systems safeguarding personal data while maintaining model performance. However, there are challenges such as the computational demands of training large LMs under DP constraints and ensuring the veracity of the synthetic data. Moreover, there is a necessity to verify minimal overlap of the training dataset with public pretraining data used in LMs to sustain privacy guarantees.
Conclusion
The paper illustrates that synthetic data from DP LMs presents a viable alternative to direct DP-training for enhancing privacy in deep retrieval systems. It emphasizes how synthetic data ensures privacy while proving to be beneficial for model utility, potentially guiding future developments toward leveraging LMs for privacy-centric solutions in information retrieval systems. The exploration into parameter-efficient DP finetuning and the continuous improvement of LM architectures could further strengthen these findings, paving the way for broader adoption and refinement of privacy-preserving machine learning techniques in practice.