Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generating Synthetic Data with Locally Estimated Distributions for Disclosure Control

Published 3 Oct 2022 in stat.CO and stat.ML | (2210.00884v2)

Abstract: Sensitive datasets are often underutilized in research and industry due to privacy concerns, limiting the potential of valuable data-driven insights. Synthetic data generation presents a promising solution to address this challenge by balancing privacy protection with data utility. This paper introduces a new approach to mitigate privacy risks associated with outlier observations in synthetic datasets: the Local Resampler (LR). The LR leverages the $k$-nearest neighbors algorithm to generate synthetic data while minimizing disclosure risks by underrepresenting outliers, even when they are not detectable in marginal distributions. Theoretical and empirical analyses demonstrate that the LR effectively mitigates outlier-driven disclosure risks, and accurately replicates multimodal, skewed, and non-convex support distributions. The semiparametric nature of the LR ensures a low computational burden and works efficiently even with small samples. By parameterizing the balance between privacy risks and data utility, this approach promotes broader access to sensitive datasets for research.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.