Causal machine learning for heterogeneous treatment effects in the presence of missing outcome data

Published 27 Dec 2024 in stat.ML and cs.LG | (2412.19711v2)

Abstract: When estimating heterogeneous treatment effects, missing outcome data can complicate treatment effect estimation, causing certain subgroups of the population to be poorly represented. In this work, we discuss this commonly overlooked problem and consider the impact that missing at random (MAR) outcome data has on causal machine learning estimators for the conditional average treatment effect (CATE). We propose two de-biased machine learning estimators for the CATE, the mDR-learner and mEP-learner, which address the issue of under-representation by integrating inverse probability of censoring weights into the DR-learner and EP-learner respectively. We show that under reasonable conditions, these estimators are oracle efficient, and illustrate their favorable performance through simulated data settings, comparing them to existing CATE estimators, including comparison to estimators which use common missing data techniques. We present an example of their application using the GBSG2 trial, exploring treatment effect heterogeneity when comparing hormonal therapies to non-hormonal therapies among breast cancer patients post surgery, and offer guidance on the decisions a practitioner must make when implementing these estimators.

Abstract PDF Upgrade to Chat

Summary

The paper introduces novel mDR-learner and mEP-learner methods specifically designed to address the challenge of estimating heterogeneous treatment effects when outcome data is missing at random.
Conventional causal machine learning methods like DR-learner and EP-learner produce biased estimates with missing data, failing to account for under-representation in subgroups.
Simulations and a real-world analysis demonstrate that the mDR-learner and mEP-learner achieve oracle efficiency and provide more stable and reliable estimates compared to traditional approaches.

An Examination of Causal Machine Learning Methods for Estimating Heterogeneous Treatment Effects with Missing Outcome Data

The paper explores the challenge of estimating heterogeneous treatment effects when outcome data is missing at random (MAR). This situation is prevalent in real-world scenarios, where data collection processes often encounter incomplete datasets. The authors critically assess the underlying biases and inefficiencies that arise from using conventional machine learning approaches to estimate the Conditional Average Treatment Effect (CATE) in these conditions.

A primary focus is the impact of MAR outcome data on existing causal machine learning estimators, specifically the DR-learner and EP-learner, which traditionally assume fully observed datasets. The paper argues that these conventional methods do not adequately address under-representation, especially in subgroups with high dropout rates, resulting in biased CATE estimates. Common strategies such as using inverse probability of censoring weights (IPCW) or data imputation are noted to introduce further inaccuracies due to the inherent complexities and slow convergence rates associated with non-parametric machine learning techniques.

To ameliorate these issues, the authors introduce two novel estimators: the mDR-learner and mEP-learner. These are enhanced versions of the DR-learner and EP-learner that incorporate correction mechanisms specifically designed for MAR data. The mDR-learner modifies the pseudo-outcome construction process by integrating IPCWs, thereby adjusting for both the missing data and pre-existing confounding in the analysis. Similarly, the mEP-learner extends the EP-learner by employing an infinite-dimensional targeting approach, aiming to stabilize CATE estimates even in the presence of extreme propensity scores.

The authors provide a detailed illustration of the empirical performance of these estimators through simulations, demonstrating that both the mDR-learner and mEP-learner achieve oracle efficiency under feasible conditions. The simulations reveal that these modified estimators outperform traditional methods in scenarios characterized by complex CATE or MAR patterns.

The practical application of the proposed methods is exemplified through an analysis of the ACTG175 trial, focusing on the efficacy of zidovudine mono-therapy compared to other antiretroviral regimes among HIV-1-infected individuals. This real-world dataset underscores the importance of robust CATE estimation techniques, as missing outcome data is a common issue in clinical trials. The findings from the ACTG175 trial analysis suggest that the mDR-learner and mEP-learner provide stable and reliable CATE estimates, accounting for under-representation due to dropout.

The paper elucidates the theoretical advancements and algorithmic implementations required for the mDR-learner and mEP-learner, offering guidance on their application in practice. It also suggests potential areas for future research, such as extending the methodologies to accommodate more complex data structures, including post-baseline covariates and datasets with missing covariate information.

The implications of this work are significant for the field of causal inference, particularly in biomedical research, where treatment effect heterogeneity is of paramount interest, and data completeness cannot be guaranteed. Advances like the mDR-learner and mEP-learner can potentially transform how practitioners address the challenges posed by missing data, enabling more accurate estimation of treatment effects and consequently informing clinical decision-making.

By providing a robust framework for addressing the intricate interactions between causal inference and data incompleteness, this paper contributes meaningfully to the toolkit available for researchers dealing with heterogeneous treatment effects in complex datasets.

Markdown Report Issue