Papers
Topics
Authors
Recent
Search
2000 character limit reached

A supervised generative optimization approach for tabular data

Published 10 Sep 2023 in cs.LG | (2309.05079v2)

Abstract: Synthetic data generation has emerged as a crucial topic for financial institutions, driven by multiple factors, such as privacy protection and data augmentation. Many algorithms have been proposed for synthetic data generation but reaching the consensus on which method we should use for the specific data sets and use cases remains challenging. Moreover, the majority of existing approaches are ``unsupervised'' in the sense that they do not take into account the downstream task. To address these issues, this work presents a novel synthetic data generation framework. The framework integrates a supervised component tailored to the specific downstream task and employs a meta-learning approach to learn the optimal mixture distribution of existing synthetic distributions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Synthetic data–what, why and how? arXiv preprint arXiv:2205.03257, 2022.
  2. Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 10(3152676):10–5555, 2017.
  3. On the utility of synthetic data: An empirical evaluation on machine learning tasks. In Proceedings of the 14th International Conference on Availability, Reliability and Security, pages 1–6, 2019.
  4. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399–410. IEEE, 2016a.
  5. Generating synthetic data to match data mining patterns. IEEE Internet Computing, 12(3):78–82, 2008.
  6. Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384, 2018.
  7. Random forests for generating partially synthetic, categorical data. Trans. Data Priv., 3(1):27–42, 2010.
  8. Sync: A copula based framework for generating synthetic data from aggregated sources. In 2020 International Conference on Data Mining Workshops (ICDMW), pages 571–578. IEEE, 2020.
  9. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019.
  10. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  11. Gaussian copula marginal regression. Electron. J. Statist., 2012.
  12. An empirical analysis of synthetic-data-based anomaly detection. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pages 306–327. Springer, 2022.
  13. Boa: The bayesian optimization algorithm. In Proceedings of the genetic and evolutionary computation conference GECCO-99. Citeseer, 1999.
  14. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.
  15. Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
  16. Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pages 1568–1577. PMLR, 2018.
  17. A multidimensional version of the kolmogorov–smirnov test. Monthly Notices of the Royal Astronomical Society, 225(1):155–170, 1987.
  18. Adversarial attacks against deep generative models on data: a survey. IEEE Transactions on Knowledge and Data Engineering, 2021.
  19. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE, 2017.
  20. How to break anonymity of the netflix prize dataset. arXiv preprint cs/0610105, 2006.
  21. Summary statistic privacy in data sharing. arXiv preprint arXiv:2303.02014, 2023.
  22. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  23. Ctab-gan+: Enhancing tabular data synthesis. arXiv preprint arXiv:2204.00401, 2022.
  24. The synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399–410, Oct 2016b. doi: 10.1109/DSAA.2016.49.
  25. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
  26. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, pages 115–123. PMLR, 2013.
  27. Daniele Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1):27–32, 2001.
Citations (2)

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.