Revisiting Counterfactual Regression through the Lens of Gromov-Wasserstein Information Bottleneck
Abstract: As a promising individualized treatment effect (ITE) estimation method, counterfactual regression (CFR) maps individuals' covariates to a latent space and predicts their counterfactual outcomes. However, the selection bias between control and treatment groups often imbalances the two groups' latent distributions and negatively impacts this method's performance. In this study, we revisit counterfactual regression through the lens of information bottleneck and propose a novel learning paradigm called Gromov-Wasserstein information bottleneck (GWIB). In this paradigm, we learn CFR by maximizing the mutual information between covariates' latent representations and outcomes while penalizing the kernelized mutual information between the latent representations and the covariates. We demonstrate that the upper bound of the penalty term can be implemented as a new regularizer consisting of $i)$ the fused Gromov-Wasserstein distance between the latent representations of different groups and $ii)$ the gap between the transport cost generated by the model and the cross-group Gromov-Wasserstein distance between the latent representations and the covariates. GWIB effectively learns the CFR model through alternating optimization, suppressing selection bias while avoiding trivial latent distributions. Experiments on ITE estimation tasks show that GWIB consistently outperforms state-of-the-art CFR methods. To promote the research community, we release our project at https://github.com/peteryang1031/Causal-GWIB.
- Estimating conditional average treatment effects. Journal of Business & Economic Statistics, 33(4):485–505, 2015.
- Counterfactual representation learning with balancing weights. In International Conference on Artificial Intelligence and Statistics, pages 1972–1980. PMLR, 2021.
- Estimating counterfactual treatment outcomes over time through adversarially balanced representations. arXiv preprint arXiv:2002.04083, 2020.
- Displacement interpolation using lagrangian mass transport. In Proceedings of the 2011 SIGGRAPH Asia conference, pages 1–12, 2011.
- Some practical guidance for the implementation of propensity score matching. Journal of economic surveys, 22(1):31–72, 2008.
- A method for assessing the quality of a randomized control trial. Controlled clinical trials, 2(1):31–49, 1981.
- Optimal transport for counterfactual estimation: A method for causal inference. arXiv preprint arXiv:2301.07755, 2023.
- Information bottleneck revisited: Posterior probability perspective with optimal transport. In 2023 IEEE International Symposium on Information Theory (ISIT), pages 1490–1495. IEEE, 2023.
- Infoot: Information maximizing optimal transport. In International Conference on Machine Learning, pages 6228–6242. PMLR, 2023.
- How sharp is the jensen inequality? Journal of Inequalities and Applications, 2015:1–10, 2015.
- Nonparametric tests for treatment effect heterogeneity. The Review of Economics and Statistics, 90(3):389–405, 2008.
- Regularized optimal transport and the rot mover’s distance. The Journal of Machine Learning Research, 19(1):590–642, 2018.
- Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. 2019.
- Eric Dunipace. Optimal transport weights for causal inference. arXiv preprint arXiv:2109.01991, 2021.
- A brief review of domain adaptation. Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, pages 877–894, 2021.
- Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1–8, 2021. URL http://jmlr.org/papers/v22/20-451.html.
- Effect of antihypertensive drug treatment on cardiovascular outcomes in women and men: a meta-analysis of individual patient data from randomized, controlled trials. Annals of internal medicine, 126(10):761–767, 1997.
- Learning individual causal effects from networked observational data. In Proceedings of the 13th international conference on web search and data mining, pages 232–240, 2020.
- The information bottleneck revisited or how to choose a good distortion measure. In 2007 IEEE International Symposium on Information Theory, pages 566–570. IEEE, 2007.
- A structural approach to selection bias. Epidemiology, pages 615–625, 2004.
- Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
- Learning representations for counterfactual inference. In International conference on machine learning, pages 3020–3029. PMLR, 2016.
- Learning weighted representations for generalization across designs. arXiv preprint arXiv:1802.08598, 2018.
- Reliable estimation of individual treatment effect with causal information bottleneck. arXiv preprint arXiv:1906.03118, 2019.
- Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10):4156–4165, 2019.
- Simon Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. arXiv preprint arXiv:1607.00345, 2016.
- Causal optimal transport for treatment effect estimation. IEEE transactions on neural networks and learning systems, 2021.
- Deep treatment-adaptive network for causal inference. The VLDB Journal, 31(5):1127–1142, 2022.
- Causal effect estimation using variational information bottleneck. In International Conference on Web Information Systems and Applications, pages 288–296. Springer, 2022.
- Deconfounding with networked observational data in a dynamic environment. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 166–174, 2021.
- Facundo Mémoli. Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11:417–487, 2011.
- Distance distributions and inverse problems for metric measure spaces. Studies in Applied Mathematics, 149(4):943–1001, 2022.
- The women and their pregnancies: the Collaborative Perinatal Study of the National Institute of Neurological Diseases and Stroke, volume 73. National Institute of Health, 1972.
- Wasserstein dependency measure for representation learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 15604–15614, 2019.
- Information bottleneck for estimating treatment effects with systematically missing covariates. Entropy, 22(4):389, 2020.
- Gromov-wasserstein averaging of kernel and distance matrices. In International conference on machine learning, pages 2664–2672. PMLR, 2016.
- Scott L Roberts. Using counterfactual history to enhance students’ historical understanding. The Social Studies, 102(3):117–123, 2011.
- Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.
- Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning, pages 3076–3085. PMLR, 2017.
- Adapting neural networks for the estimation of treatment effects. Advances in neural information processing systems, 32, 2019.
- On the application of probability theory to agricultural experiments. essay on principles. section 9. Statistical Science, pages 465–472, 1990.
- Robustness-enhanced uplift modeling with adversarial feature desensitization. arXiv preprint arXiv:2310.04693, 2023.
- The information bottleneck method. In The 37th annual Allerton Conference on Communication, Control, and Computing, pages 368––377, 1999.
- Optimal transport for structured data with application on graphs. In International Conference on Machine Learning, pages 6275–6284. PMLR, 2019.
- An optimal transport approach to causal inference. arXiv preprint arXiv:2108.05858, 2021.
- The monge gap: A regularizer to learn all transport maps. arXiv preprint arXiv:2302.04953, 2023.
- Optimal transport for treatment effect estimation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Stable estimation of heterogeneous treatment effects. In International Conference on Machine Learning, pages 37496–37510. PMLR, 2023.
- Representation learning for treatment effect estimation from observational data. Advances in neural information processing systems, 31, 2018.
- A survey on causal inference. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(5):1–46, 2021.
- On learning invariant representations for domain adaptation. In International conference on machine learning, pages 7523–7532. PMLR, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.