Breaking the curse of horizon in conformal off-policy prediction

Determine how to break the curse of horizon in conformal off-policy prediction for sequential decision making by constructing prediction interval procedures that remain efficient for long decision horizons while retaining valid coverage guarantees for the potential outcome under a given target policy.

Background

The paper proposes COPP (Conformal Off-Policy Prediction) to construct prediction intervals for individual outcomes under a target policy, extending weighted conformal prediction to the off-policy setting via an auxiliary policy and subsampling. In sequential decision making, the method selects subsamples where the auxiliary policy matches the behavior policy along the trajectory to emulate the target outcome distribution, and then uses weighted conformal prediction to account for covariate shift.

A noted limitation is the common curse of horizon in off-policy learning: as the number of stages increases, the probability of selecting trajectories that match the auxiliary and behavior policies at all stages decreases rapidly, leading to dwindling effective sample sizes and reduced efficiency. The authors propose importance-sampling and multi-sampling extensions to partially mitigate this issue, but acknowledge that the fundamental problem persists in long-horizon settings.

References

It can be seen that the proposed method is able to achieve nominal coverage in general. Nonetheless, as commented in our paper, it suffers from the curse of horizon and would be inefficient in long-horizon settings. It remains unclear how to break the curse of horizon and we leave it as future work.

Conformal Off-policy Prediction  (2206.06711 - Zhang et al., 2022) in Section 5, Synthetic Data Analysis (Results, Example 3)