Distributionally Robust Data Join
Abstract: Suppose we are given two datasets: a labeled dataset and unlabeled dataset which also has additional auxiliary features not present in the first dataset. What is the most principled way to use these datasets together to construct a predictor? The answer should depend upon whether these datasets are generated by the same or different distributions over their mutual feature sets, and how similar the test distribution will be to either of those distributions. In many applications, the two datasets will likely follow different distributions, but both may be close to the test distribution. We introduce the problem of building a predictor which minimizes the maximum loss over all probability distributions over the original features, auxiliary features, and binary labels, whose Wasserstein distance is $r_1$ away from the empirical distribution over the labeled dataset and $r_2$ away from that of the unlabeled dataset. This can be thought of as a generalization of distributionally robust optimization (DRO), which allows for two data sources, one of which is unlabeled and may contain auxiliary features.
- A reductions approach to fair classification. In International Conference on Machine Learning, pages 60–69. PMLR, 2018.
- Evaluating fairness of machine learning models under uncertain and incomplete information. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 206–214, 2021.
- Dimitri P Bertsekas. Convex optimization theory.
- Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600, 2019.
- Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857, 2019.
- Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.
- Fair classification with noisy protected attributes: A framework with provable guarantees. In International Conference on Machine Learning, pages 1349–1361. PMLR, 2021a.
- Fair classification with adversarial perturbations. arXiv preprint arXiv:2106.05964, 2021b.
- Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
- Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595–612, 2010.
- Convergent algorithms for (relaxed) minimax fairness. arXiv e-prints, pages arXiv–2011, 2020.
- Multiaccurate proxies for downstream fairness. arXiv preprint arXiv:2107.04423, 2021.
- A gentle introduction to imputation of missing values. Journal of clinical epidemiology, 59(10):1087–1091, 2006.
- Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
- Ambiguous chance constrained problems and robust optimization. Mathematical Programming, 107(1):37–61, 2006.
- Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1):115–166, 2018.
- When race/ethnicity data are lacking. RAND Health Q, 6:1–6, 2016.
- Distributionally robust optimization and its tractable approximations. Operations research, 58(4-part-1):902–917, 2010.
- Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929–1938. PMLR, 2018.
- Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 2013.
- Fairness without imputation: A decision tree approach for fair prediction with missing values. arXiv preprint arXiv:2109.10431, 2021.
- Assessing algorithmic fairness with unobserved protected class using data combination. Management Science, 2021.
- Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, pages 2564–2572. PMLR, 2018.
- Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247–254, 2019.
- Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
- Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations research & management science in the age of analytics, pages 130–166. Informs, 2019.
- Minimax statistical learning with wasserstein distances. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 2692–2701, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ea8fcd92d59581717e06eb187f10666d-Abstract.html.
- Probabilistic matrix factorization. In Advances in neural information processing systems, pages 1257–1264, 2008.
- Distributionally robust optimization: A review. arXiv preprint arXiv:1908.05659, 2019.
- Patrick Royston. Multiple imputation of missing values. The Stata Journal, 4(3):227–241, 2004.
- Distributionally robust logistic regression. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1576–1584, 2015. URL https://proceedings.neurips.cc/paper/2015/hash/cc1aa436277138f61cda703991069eaf-Abstract.html.
- Alexander Shapiro. On duality theory of conic linear problems. In Semi-infinite programming, pages 135–165. Springer, 2001.
- Thomas Strömberg. A study of the operation of infimal convolution. PhD thesis, Luleå tekniska universitet, 1994.
- A distributionally robust approach to fair classification. arXiv preprint arXiv:2007.09530, 2020.
- Cédric Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
- Advancing health care equity through improved data collection. The New England journal of medicine, 364(24):2276–2277, 2011.
- Distributionally robust convex optimization. Operations Research, 62(6):1358–1376, 2014.
- Yan Zhang. Assessing fair lending risks using race/ethnicity proxies. Management Science, 64(1):178–197, 2018.
- Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.
- Xiaojin Jerry Zhu. Semi-supervised learning literature survey. 2005.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.