Recourse Bandit Problem Overview
- Recourse Bandit Problem is a sequential decision-making framework that integrates action selection with minimal, feasible modifications to mutable features under explicit constraints.
- Key methodologies such as RLinUCB and GLRB employ ridge estimators and confidence sets to balance reward maximization with the cost of recourse.
- Hybrid algorithms like HR-Bandit and LIBRA leverage human expert or LLM input to accelerate early-stage learning while maintaining robust long-term performance guarantees.
The recourse bandit problem addresses sequential decision-making scenarios in which, at each round, an agent observes a context partitioned into immutable and mutable features, then must select an action (arm) and a minimal, feasible recourse—i.e., a change to the mutable features—subject to explicit constraints. The objective is to optimize cumulative reward by judiciously choosing both the action and context recourse while trading off reward improvement against the cost or feasibility of recourse. This structure arises naturally in personalized interventions, such as healthcare, credit, or behavioral recommender systems, where recommendations involve both an actionable suggestion (“change your behavior in this minimally disruptive way”) and a decision (“take treatment a versus b”).
1. Formal Definition and Foundations
At each time , the learner receives a context , decomposed as for immutable () and mutable () features, with (Cao et al., 2024, Cao et al., 17 Jan 2026). The agent chooses:
- An action , .
- A feasible recourse such that , with and norm user-chosen (e.g., , weighted ).
The reward follows a generalized linear model (GLM):
where is unknown, a strictly increasing, Lipschitz link function, and noise (-sub-Gaussian).
Recourse regret quantifies performance:
where are the optimal choices given the realized context and model.
2. Algorithms for the Recourse Bandit Problem
RLinUCB and Generalizations
The Recourse Linear UCB (RLinUCB) (Cao et al., 2024) extends classic UCB-style exploration to (arm, recourse) pairs. At each round, for every arm , it maintains a ridge estimator and design matrix . The key optimization is
where the confidence radius scales with data and regularization.
For GLM rewards, the Generalized Linear Recourse Bandit (GLRB) (Cao et al., 17 Jan 2026) maintains a self-normalized confidence set for each arm, using regularized MLE and updating after observing the reward.
Both approaches solve at each round an optimistic recourse optimization (ORO-Arm), balancing estimated return and uncertainty to encourage informative exploration.
Human-AI and LLM-Augmented Variants
HR-Bandit (Cao et al., 2024) introduces a hybrid mechanism: at rounds where the algorithm’s own UCB uncertainty is high, a black-box human expert is queried for (action, recourse) proposals. A data-driven selection rule with trust parameter decides which suggestion to implement, providing performance guarantees under bounded human suboptimality. The querying frequency is adaptively controlled to remain finite in the long run.
LIBRA (Cao et al., 17 Jan 2026) generalizes this paradigm, replacing the human expert with a LLM. Each round, if bandit confidence is insufficient (interval width ), the LLM is queried for a recommendation; otherwise, the algorithm proceeds autonomously. This enables LLM-guided warm-starts and domain knowledge injection, while avoiding over-reliance on possibly suboptimal external suggestions.
3. Theoretical Guarantees and Regret Analysis
Rigorous analysis yields matching upper and lower regret bounds:
- Both RLinUCB and GLRB achieve, with high probability,
recourse regret (Cao et al., 2024, Cao et al., 17 Jan 2026).
- With an external oracle (expert or LLM) whose proposals are uniformly -suboptimal, hybrid algorithms (HR-Bandit and LIBRA) satisfy a warm-start guarantee:
for HR-Bandit, or similar scaling for LIBRA, reflecting acceleration when the external agent is high-quality, but preserving worst-case safety.
- Effort guarantees: HR-Bandit asks for human input only times per arm; LIBRA queries the LLM at most times across rounds, transitioning to full bandit autonomy (Cao et al., 2024, Cao et al., 17 Jan 2026).
- Robustness guarantee: If the external expert is suboptimal or adversarial, these algorithms never perform worse (up to constants) than their pure bandit counterparts.
Lower bounds: Any algorithm, even with oracle hints, must incur
regret in the worst case, which tightens the theoretical picture developed by RLinUCB, GLRB, HR-Bandit, and LIBRA.
4. Methodological Comparison
| Algorithm | Expert/LLM Involvement | Regret (Best Case) | Human/LLM Query Complexity |
|---|---|---|---|
| RLinUCB | None | 0 | |
| HR-Bandit | Human—adaptive, on high-uncertainty | per arm | |
| GLRB | None (GLM rewards) | 0 | |
| LIBRA | LLM—adaptive, high-confidence gap | or |
A plausible implication is that hybridization with external experts or LLMs is effective primarily during initial learning phases, after which bandit-based learning dominates performance.
5. Practical Applications and Experimental Findings
Experiments on both synthetic and real healthcare datasets (Cao et al., 2024, Cao et al., 17 Jan 2026) validate the theoretical properties:
- Healthcare case studies: Tasks involve contexts split into immutable and mutable features (e.g., demographic data and lifestyle behaviors), with linear or GLM reward models per intervention (e.g., treatment/no-treatment, different medications).
- Baseline (no recourse) algorithms like LinUCB incur linear regret, being unable to optimize mutable features.
- RLinUCB/GLRB achieve sublinear regret of .
- HR-Bandit and LIBRA further accelerate early-stage learning (20–30% lower regret initially), quickly “warm-start” with domain knowledge, and achieve long-term regret competitive with the bandit-only approach.
- Human/LLM queries are sharply limited (HR-Bandit: per arm, LIBRA: in rounds).
- Qualitative outcomes: Bandit-based recourse algorithms learn to steer mutable features toward intervention-relevant targets; expert-informed hybrids align initial suggestions with clinical or domain heuristics before fully adapting to empirical feedback.
6. Fundamental Limits and Extensions
Established lower bounds demonstrate that the recourse bandit setting is intrinsically harder than standard contextual bandits when the mutable dimension or recourse budget is large. The present algorithms, including RLinUCB, GLRB, HR-Bandit, and LIBRA, achieve near-optimal regret up to logarithmic factors (Cao et al., 17 Jan 2026).
Promising directions for further study include developing scalable optimization procedures for high-dimensional or structured recourse constraints, non-linear reward settings beyond GLM, and principled ways to fuse expert/LLM hints under partial observability or adversarial feedback. Robustness to strategic compliance, missingness, and active feature acquisition also remain open.
7. Connections and Significance
Recourse bandits unify algorithmic recourse and contextual bandit learning, offering a formal framework for sequential, counterfactually informed decision-making under real-world constraints. The integration of human expertise or LLM advice with statistically grounded learning achieves fast-initialization, sustained improvement, and robust autonomy, especially in settings—such as medicine—where cautious, adaptive personalization is essential. Theoretical upper and lower bounds clarify the fundamental complexity of the task and demonstrate the competitiveness of current methodologies (Cao et al., 2024, Cao et al., 17 Jan 2026).