Papers
Topics
Authors
Recent
Search
2000 character limit reached

Recourse Bandit Problem Overview

Updated 24 January 2026
  • Recourse Bandit Problem is a sequential decision-making framework that integrates action selection with minimal, feasible modifications to mutable features under explicit constraints.
  • Key methodologies such as RLinUCB and GLRB employ ridge estimators and confidence sets to balance reward maximization with the cost of recourse.
  • Hybrid algorithms like HR-Bandit and LIBRA leverage human expert or LLM input to accelerate early-stage learning while maintaining robust long-term performance guarantees.

The recourse bandit problem addresses sequential decision-making scenarios in which, at each round, an agent observes a context partitioned into immutable and mutable features, then must select an action (arm) and a minimal, feasible recourse—i.e., a change to the mutable features—subject to explicit constraints. The objective is to optimize cumulative reward by judiciously choosing both the action and context recourse while trading off reward improvement against the cost or feasibility of recourse. This structure arises naturally in personalized interventions, such as healthcare, credit, or behavioral recommender systems, where recommendations involve both an actionable suggestion (“change your behavior in this minimally disruptive way”) and a decision (“take treatment a versus b”).

1. Formal Definition and Foundations

At each time tt, the learner receives a context xtRdx_t \in \mathbb{R}^d, decomposed as xt=(xt,I,xt,M)x_t = (x_{t,I}, x_{t,M}) for immutable (dId_I) and mutable (dMd_M) features, with d=dI+dMd = d_I + d_M (Cao et al., 2024, Cao et al., 17 Jan 2026). The agent chooses:

  • An action atAa_t \in \mathcal{A}, A=K|\mathcal{A}| = K.
  • A feasible recourse xˇt,M\check{x}_{t,M} such that xˇt,Mxt,Mγ\| \check{x}_{t,M} - x_{t,M} \| \leq \gamma, with γ>0\gamma > 0 and norm user-chosen (e.g., 2\ell_2, weighted \ell_\infty).

The reward follows a generalized linear model (GLM):

rt=μ((xt,I,xˇt,M)θat)+ξt,r_t = \mu\left( (x_{t,I}, \check{x}_{t,M})^\top \theta_{a_t}^\star \right) + \xi_t,

where θatRd\theta_{a_t}^\star \in \mathbb{R}^d is unknown, μ()\mu(\cdot) a strictly increasing, Lipschitz link function, and ξt\xi_t noise (σ\sigma-sub-Gaussian).

Recourse regret quantifies performance:

Regret(T)=E[t=1Trt(xt,at)rt(xˇt,at)],\text{Regret}(T) = \mathbb{E}\left[ \sum_{t=1}^{T} r_t(x_t^\star,a_t^\star) - r_t(\check{x}_t,a_t) \right],

where (xt,at)(x_t^\star,a_t^\star) are the optimal choices given the realized context and model.

2. Algorithms for the Recourse Bandit Problem

RLinUCB and Generalizations

The Recourse Linear UCB (RLinUCB) (Cao et al., 2024) extends classic UCB-style exploration to (arm, recourse) pairs. At each round, for every arm aa, it maintains a ridge estimator θ^t1,a\hat{\theta}_{t-1,a} and design matrix Vt1,aV_{t-1,a}. The key optimization is

(at,Δt)=argmaxaA,Δγ[(xt+Δ)θ^t1,a+βt,axt+ΔVt1,a1],(a_t, \Delta_t) = \arg\max_{a \in \mathcal{A}, \|\Delta\| \leq \gamma} \left[ (x_t+\Delta)^\top \hat{\theta}_{t-1,a} + \beta_{t,a} \|x_t+\Delta\|_{V_{t-1,a}^{-1}} \right],

where the confidence radius βt,a\beta_{t,a} scales with data and regularization.

For GLM rewards, the Generalized Linear Recourse Bandit (GLRB) (Cao et al., 17 Jan 2026) maintains a self-normalized confidence set Θt,a\Theta_{t,a} for each arm, using regularized MLE and updating after observing the reward.

Both approaches solve at each round an optimistic recourse optimization (ORO-Arm), balancing estimated return and uncertainty to encourage informative exploration.

Human-AI and LLM-Augmented Variants

HR-Bandit (Cao et al., 2024) introduces a hybrid mechanism: at rounds where the algorithm’s own UCB uncertainty is high, a black-box human expert is queried for (action, recourse) proposals. A data-driven selection rule with trust parameter ζ\zeta decides which suggestion to implement, providing performance guarantees under bounded human suboptimality. The querying frequency is adaptively controlled to remain finite in the long run.

LIBRA (Cao et al., 17 Jan 2026) generalizes this paradigm, replacing the human expert with a LLM. Each round, if bandit confidence is insufficient (interval width >Δ>\Delta), the LLM is queried for a recommendation; otherwise, the algorithm proceeds autonomously. This enables LLM-guided warm-starts and domain knowledge injection, while avoiding over-reliance on possibly suboptimal external suggestions.

3. Theoretical Guarantees and Regret Analysis

Rigorous analysis yields matching upper and lower regret bounds:

  • Both RLinUCB and GLRB achieve, with high probability,

O~(dKT)\widetilde{O}( d \sqrt{K T} )

recourse regret (Cao et al., 2024, Cao et al., 17 Jan 2026).

  • With an external oracle (expert or LLM) whose proposals are uniformly η\eta-suboptimal, hybrid algorithms (HR-Bandit and LIBRA) satisfy a warm-start guarantee:

Regret(T)2min{ηT,(1+ζ)O~(dKT)}\text{Regret}(T) \leq 2 \min\{ \eta T, (1+\zeta) \widetilde{O}(d \sqrt{K T}) \}

for HR-Bandit, or similar scaling for LIBRA, reflecting acceleration when the external agent is high-quality, but preserving worst-case safety.

  • Effort guarantees: HR-Bandit asks for human input only O(1)O(1) times per arm; LIBRA queries the LLM at most O(log2T)O(\log^2 T) times across TT rounds, transitioning to full bandit autonomy (Cao et al., 2024, Cao et al., 17 Jan 2026).
  • Robustness guarantee: If the external expert is suboptimal or adversarial, these algorithms never perform worse (up to constants) than their pure bandit counterparts.

Lower bounds: Any algorithm, even with oracle hints, must incur

Ω(γdMKTdKT)\Omega\left( \gamma d_M \sqrt{K T} \vee \sqrt{d K T} \right)

regret in the worst case, which tightens the theoretical picture developed by RLinUCB, GLRB, HR-Bandit, and LIBRA.

4. Methodological Comparison

Algorithm Expert/LLM Involvement Regret (Best Case) Human/LLM Query Complexity
RLinUCB None O~(dKT)\widetilde{O}(d\sqrt{KT}) 0
HR-Bandit Human—adaptive, on high-uncertainty O~(dKT)\widetilde{O}(d\sqrt{KT}) O(1)O(1) per arm
GLRB None (GLM rewards) O~(dKT)\widetilde{O}(d\sqrt{KT}) 0
LIBRA LLM—adaptive, high-confidence gap O~(dKT)\widetilde{O}(d\sqrt{KT}) or O(ηT)O(\eta T) O(log2T)O(\log^2 T)

A plausible implication is that hybridization with external experts or LLMs is effective primarily during initial learning phases, after which bandit-based learning dominates performance.

5. Practical Applications and Experimental Findings

Experiments on both synthetic and real healthcare datasets (Cao et al., 2024, Cao et al., 17 Jan 2026) validate the theoretical properties:

  • Healthcare case studies: Tasks involve contexts split into immutable and mutable features (e.g., demographic data and lifestyle behaviors), with linear or GLM reward models per intervention (e.g., treatment/no-treatment, different medications).
    • Baseline (no recourse) algorithms like LinUCB incur linear regret, being unable to optimize mutable features.
    • RLinUCB/GLRB achieve sublinear regret of O~(dT)\widetilde{O}(d\sqrt{T}).
    • HR-Bandit and LIBRA further accelerate early-stage learning (20–30% lower regret initially), quickly “warm-start” with domain knowledge, and achieve long-term regret competitive with the bandit-only approach.
    • Human/LLM queries are sharply limited (HR-Bandit: O(1)O(1) per arm, LIBRA: O(log2T)O(\log^2 T) in TT rounds).
  • Qualitative outcomes: Bandit-based recourse algorithms learn to steer mutable features toward intervention-relevant targets; expert-informed hybrids align initial suggestions with clinical or domain heuristics before fully adapting to empirical feedback.

6. Fundamental Limits and Extensions

Established lower bounds demonstrate that the recourse bandit setting is intrinsically harder than standard contextual bandits when the mutable dimension dMd_M or recourse budget γ\gamma is large. The present algorithms, including RLinUCB, GLRB, HR-Bandit, and LIBRA, achieve near-optimal regret up to logarithmic factors (Cao et al., 17 Jan 2026).

Promising directions for further study include developing scalable optimization procedures for high-dimensional \ell_\infty or structured recourse constraints, non-linear reward settings beyond GLM, and principled ways to fuse expert/LLM hints under partial observability or adversarial feedback. Robustness to strategic compliance, missingness, and active feature acquisition also remain open.

7. Connections and Significance

Recourse bandits unify algorithmic recourse and contextual bandit learning, offering a formal framework for sequential, counterfactually informed decision-making under real-world constraints. The integration of human expertise or LLM advice with statistically grounded learning achieves fast-initialization, sustained improvement, and robust autonomy, especially in settings—such as medicine—where cautious, adaptive personalization is essential. Theoretical upper and lower bounds clarify the fundamental complexity of the task and demonstrate the competitiveness of current methodologies (Cao et al., 2024, Cao et al., 17 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recourse Bandit Problem.