Papers
Topics
Authors
Recent
Search
2000 character limit reached

Offline Imitation Learning in Contextual Bandits

Updated 20 October 2025
  • The paper introduces a PIL-IML framework that leverages a surrogate objective to approximate the reward-maximizing policy while controlling high importance weight variance.
  • It employs reward-weighted cross-entropy when logging probabilities are missing, ensuring robust policy learning despite incomplete propensity scores and confounding variables.
  • Empirical evaluations on simulated and real-world datasets demonstrate improved performance over traditional IPWE methods, supporting safe deployment and effective diagnostic assessment.

Offline imitation learning in contextual bandits is the study and practice of inferring a reward-maximizing policy from static, logged data generated by a historical decision-making system operating under the contextual bandit paradigm. This learning challenge centers on how to utilize data comprised of context–action–reward tuples (often with randomized action selection and possibly missing probabilities), to construct policies that can either safely imitate—or improve upon—the historical behavior while avoiding the pitfalls of confounding, distribution shift, and high variance estimation.

1. Problem Formulation and Contextual Bandit Model

The offline contextual bandit model considered in (Ma et al., 2019) collects data consisting of tuples (x,a,r)(x, a, r), where xx denotes the context, aa the chosen action (by a logging policy μ\mu), and rr the observed (non-negative) reward. Notably, action selection is often randomized by μ\mu, and rewards exist only for the actions actually taken. Furthermore, unobserved confounders hh that influence both aa and rr may be present, resulting in the data-generating process:

  • (x,h)∼P(x,h)(x, h) \sim P(x, h),
  • xx0 (possibly context- and confounder-dependent action set),
  • xx1.

The learning objective is to construct a new policy xx2 that, if deployed, will maximize expected reward:

xx3

despite having only empirical feedback from the logged policy rather than all possible actions under each context.

2. Inverse Probability Weighted Estimation and Its Limitations

A canonical approach for off-policy evaluation and imitation in offline contextual bandits is inverse probability weighted estimation (IPWE):

xx4

where xx5 is the importance weight. This estimator is unbiased under known propensities but is vulnerable to:

  • Unavailable or missing logging probabilities (engineering/collection limitations),
  • Small logging probabilities: If xx6 is close to zero, xx7 becomes large, leading to extremely high variance in xx8, with single rare events potentially dominating the sum and rendering confidence intervals and significance testing unreliable.

3. Policy Improvement Objectives and Policy Imitation Regularization

To address IPWE's variance and applicability limitations, a policy improvement objective (PIL) is proposed. PIL is a lower-bound surrogate to IPWE, derived from the inequality xx9. With available probabilities, a useful lower bound is:

aa0

whereas with missing aa1, the formulation reverts to reward-weighted cross-entropy:

aa2

Policy Imitation Learning (IML) regularizes this objective:

aa3

which is the KL-divergence aa4 averaged over aa5—minimizing IML encourages aa6 to mimic the logging policy, thereby reducing the variance of the importance weights. Explicitly, Taylor expansion reveals:

aa7

making the average IML loss a direct proxy for the variance of IPWE.

The unified learning objective is:

aa8

where aa9 governs the exploitation–exploration (variance–bias) balance. When logging probabilities are unavailable, reward-weighted cross-entropy (as in standard supervised learning) is shown to be a justifiable surrogate.

4. Probability Logging, Diagnosability, and Confounding

Probability logging, the practice of storing μ\mu0, serves two crucial purposes:

  • Bias Correction: With complete action propensities, unbiased IPWE or PIL can be performed.
  • Model Diagnosability: High IML loss (or high perplexity) is a diagnostic for either confounding (missing variables μ\mu1) or policy class misspecification, since the logging policy cannot be well-explained (or imitated) using the available model class. Thus, IML underfitting flags hidden influences and motivates model refinement or careful policy deployment.

The framework is thus adaptive: with missing probabilities, it defaults to robustly regularized cross-entropy; with full propensities, it can both debias and analyze model misspecification.

5. Simulation Results and Empirical Insights

Simulation studies validate the framework on both classical and real-world datasets:

  • Simpson’s Paradox (Kidney Stone Data): By modeling both observed (size) and hidden confounders during randomized assignment, the PIL-IML approach correctly re-weights to recommend the effective treatment, overcoming the paradox that plagues unadjusted analyses.
  • UCI Multiclass-to-Bandit Conversions: Benchmarking against Q-learning, vanilla IPWE, and doubly robust estimators, the PIL-IML approach demonstrates lower variance and superior performance, particularly under model misspecification. The reward-weighted cross-entropy surrogate proves robust when action propensities are missing.
  • Criteo Counterfactual Data: Facing extreme heavy-tail importance weights (up to 49,000), IPWE is unusable without variance control. PIL-IML, along with weight clipping and bootstrapping, produces usable estimates, and reveals via IML that essential confounders are not available in the dataset—a practical diagnostic for real-world data quality limitations.

Moreover, the method supports IML-resampling: Using the learned imitation policy to resample the logged data, thereby increasing exploration and improving subsequent learning.

6. Practical Implications and Deployment Considerations

  • Variance Control: Explicit regularization by KL-divergence (IML) is effective at controlling the variance of offline policy improvement, which is especially critical in high-dimensional action spaces or sparse data regimes.
  • Policy Class Selection: The IML diagnostic (high perplexity) provides a principled way to assess whether the chosen policy parameterization is adequate or confounded, guiding both model selection and safe policy deployment.
  • Handling Missing Data: The explicit connection between cross-entropy loss and the IPWE surrogate allows practitioners to deploy offline imitation learning without strict logging requirements, making the approach robust to operational and engineering challenges.
  • Future Optimization: Weight clipping, reward-weighted losses, and IML-resampling are recommended when deploying in environments with heavy-tailed propensities or suspected confounding.

7. Summary Table: Key Techniques and Formulae

Technique / Concept Mathematical Expression Role / Purpose
IPWE μ\mu2 Unbiased policy value estimation
Weight Clipping, PIL see above for PIL expressions Variance reduction, lower-bounding
Reward-weighted Cross-Entropy (CE) μ\mu3 Surrogate for missing propensities
IML Regularization μ\mu4, approximates variance Variance control, misspecification
Total Objective μ\mu5 Combines policy improvement & control
Policy Diagnosability via IML High IML loss μ\mu6 confounding or misspecification Data/model quality assessment
Greedy Policy Update Gradient of PIL-IML objective Local equivalence to natural gradient

This framework—anchored in convex surrogates for high-variance estimators and regularization by policy imitation—provides both sound theoretical guarantees and empirical evidence for its effectiveness in offline imitation learning for contextual bandits. The diagnostic properties of IML loss, adaptability to incomplete logs, and robust empirical performance across datasets establish it as a foundational approach for real-world, reliable offline policy learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Offline Imitation Learning in Contextual Bandits.