Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tunable correlation retention: A statistical method for generating synthetic data

Published 3 Mar 2024 in cs.LG, math.PR, and physics.data-an | (2403.01471v3)

Abstract: We propose a method to generate statistically representative synthetic data from a given dataset. The main goal of our method is for the created data set to mimic the inter--feature correlations present in the original data, while also offering a tunable parameter to influence the privacy level. In particular, our method constructs a statistical map by using the empirical conditional distributions between the features of the original dataset. Part of the tunability is achieved by limiting the depths of conditional distributions that are being used. We describe in detail our algorithms used both in the construction of a statistical map and how to use this map to generate synthetic observations. This approach is tested in three different ways: with a hand calculated example; a manufactured dataset; and a real world energy-related dataset of consumption/production of households in Madeira Island. We evaluate the method by comparing the datasets using the Pearson correlation matrix with different levels of resolution and depths of correlation. These two considerations are being viewed as tunable parameters influencing the resulting datasets fidelity and privacy. The proposed methodology is general in the sense that it does not rely on the used test dataset. We expect it to be applicable in a much broader context than indicated here.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. G. Gogoshin, S. Branciamore and A.S. Rodin “Enhancing manufacturing operations with synthetic data: a systematic framework for data generation, accuracy, and utility” In Front. Manuf. Technol. 4 Frontiers, 2024, pp. 1320166
  2. “Survey on synthetic data generation, evaluation methods and GANs” In Mathematics 10.15, 2022 DOI: 10.3390/math10152733
  3. J. Young, P. Graham and R. Penny “Using Bayesian networks to create synthetic data” In Journal of Official Statistics 25.4 Statistics Sweden (SCB), 2009, pp. 549–567
  4. G. Gogoshin, S. Branciamore and A.S. Rodin “Synthetic data generation with probabilistic Bayesian Networks” In Mathematical Biosciences and Engineering: MBE 18.6 NIH Public Access, 2021, pp. 8603
  5. “Confidential machine learning on untrusted platforms: a survey” In Cybersecurity 4.30, 2021
  6. “Synthetic Data – what, why and how?” Royal Society, Alan Turing Institute, 2022 URL: https://royalsociety.org/-/media/policy/projects/privacy-enhancing-technologies/Synthetic_Data_Survey-24.pdf
  7. “Probabilistic Theory of Mean Field Games with Applications I Mean Field FBSDEs, Control, and Games” In Probability Theory and Stochastic Modelling, Probability Theory and Stochastic Modelling United States: Springer Nature, 2018, pp. 1–695
  8. “A dataset for non-intrusive load monitoring: Design and implementation” In Energies 13.20, 2020 DOI: 10.3390/en13205371
  9. “NILM techniques for intelligent home energy management and ambient assisted living: A review” In Energies, 2019 URL: https://api.semanticscholar.org/CorpusID:195061785
  10. “SustData: A Public Dataset for ICT4S Electric Energy Research” In Proceedings of the 2014 conference ICT for Sustainability Atlantis Press, 2014, pp. 359–368 DOI: 10.2991/ict4s-14.2014.44
  11. “SciPy 1.0: Fundamental algorithms for scientific computing in Python” In Nature Methods 17, 2020, pp. 261–272 DOI: 10.1038/s41592-019-0686-2
  12. “General and specific utility measures for synthetic data” In Journal of the Royal Statistical Society Series A: Statistics in Society 181.3, 2018, pp. 663–688 DOI: 10.1111/rssa.12358
  13. R. Yuan “A synthetic dataset of Danish residential electricity prosumers” In Scientific Data 10.371, 2023 URL: https://rdcu.be/dkbsE
  14. “A new PCA-based utility measure for synthetic data evaluation”, 2022 DOI: 10.48550/arXiv.2212.05595
  15. M. Boedihardjo, T. Strohmer and R. Vershynin “Private sampling: a noiseless approach for generating differentially private synthetic data” In SIAM Journal on Mathematics of Data Science 4, 2022, pp. 1082–1115
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.