A Unified Framework for Quantifying Privacy Risk in Synthetic Data

Published 18 Nov 2022 in cs.CR | (2211.10459v1)

Abstract: Synthetic data is often presented as a method for sharing sensitive information in a privacy-preserving manner by reproducing the global statistical properties of the original data without disclosing sensitive information about any individual. In practice, as with other anonymization methods, privacy risks cannot be entirely eliminated. The residual privacy risks need instead to be ex-post assessed. We present Anonymeter, a statistical framework to jointly quantify different types of privacy risks in synthetic tabular datasets. We equip this framework with attack-based evaluations for the singling out, linkability, and inference risks, the three key indicators of factual anonymization according to the European General Data Protection Regulation (GDPR). To the best of our knowledge, we are the first to introduce a coherent and legally aligned evaluation of these three privacy risks for synthetic data, and to design privacy attacks which model directly the singling out and linkability risks. We demonstrate the effectiveness of our methods by conducting an extensive set of experiments that measure the privacy risks of data with deliberately inserted privacy leakages, and of synthetic data generated with and without differential privacy. Our results highlight that the three privacy risks reported by our framework scale linearly with the amount of privacy leakage in the data. Furthermore, we observe that synthetic data exhibits the lowest vulnerability against linkability, indicating one-to-one relationships between real and synthetic data records are not preserved. Finally, we demonstrate quantitatively that Anonymeter outperforms existing synthetic data privacy evaluation frameworks both in terms of detecting privacy leaks, as well as computation speed. To contribute to a privacy-conscious usage of synthetic data, we open source Anonymeter at https://github.com/statice/anonymeter.

Abstract PDF Upgrade to Chat

Citations (40)

View on Semantic Scholar

Summary

The paper introduces the Statice Privacy Assessment framework, quantifying privacy risks such as singling out, linkability, and inference in synthetic data.
It employs a three-phase methodology—attack simulation, evaluation, and risk quantification with robust statistical metrics—to assess privacy breaches.
Results indicate a linear relationship between auxiliary data knowledge and privacy leakage, with differential privacy measures effectively reducing risks.

A Unified Framework for Quantifying Privacy Risk in Synthetic Data

Introduction

The paper "A Unified Framework for Quantifying Privacy Risk in Synthetic Data" (2211.10459) introduces the Statice Privacy Assessment, a pioneering framework designed to quantify privacy risks in synthetic datasets. This framework addresses the shortcomings of existing methods by providing a coherent evaluation of singling out, linkability, and inference risks, which are critical from both a practical and legal perspective, complying with regulations like the GDPR.

Methodology

Framework Architecture

The Statice Privacy Assessment framework operates through a structured three-phase process:

Attack Phase: This phase involves generating guesses about the original dataset using synthetic data. It includes three types of attacks—main attack using synthetic data, a naive baseline attack, and a control attack using a separate dataset not involved in training.
Evaluation Phase: The generated guesses are evaluated against the training and control datasets to determine their correctness, providing a measure of each attack's effectiveness.
Risk Quantification Phase: Statistically robust metrics are used to quantify privacy risks based on the success rates of the attacks. This incorporates the evaluation results to determine excess success rates indicative of potential privacy breaches.
Figure 1: Schematic overview of our framework showcasing the attack, evaluation, and risk estimation phases.

Attacks Implemented

The framework implements three specific attacks:

Singling Out: Determines the risk of isolating a single individual within a dataset.
Linkability: Evaluates the risk of linking together disparate data pieces to identify an individual.
Inference: Assesses the risk of deducing sensitive information about individuals based on partial knowledge.

Results

The application of Statice Privacy Assessment proves robust across multiple datasets, demonstrating a consistent linear relationship between synthetic data’s privacy leakage and the number of known auxiliary data attributes (Figure 2). Notably, synthetic data generated with differential privacy (DP) guarantees showed reduced privacy risks, particularly in inference and linkability.

Figure 2: Estimated privacy risks across datasets as a function of attacker's auxiliary knowledge.

Discussion

The paper underlines the advantages of the Statice Privacy Assessment in providing a nuanced, legally aligned privacy risk assessment. Unlike previous methods, this framework separates general dataset utility from specific privacy violations, ensuring that the risks reported are truly reflective of the synthetic data's interaction with the original dataset. Additionally, its modular design allows for future enhancements, including more complex or novel attack implementations.

Conclusion

The Statice Privacy Assessment provides a significant advancement in synthetic data privacy evaluations by effectively quantifying risks in alignment with regulatory requirements. Its open-source availability encourages broader adoption and further refinement, contributing to the responsible deployment of synthetic data technologies.

In summary, the framework offers a comprehensive, empirically validated approach to measuring privacy risks, significantly enhancing our ability to manage and mitigate these risks in practical applications.