Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generalized Linear Recourse Bandit

Updated 24 January 2026
  • Generalized Linear Recourse Bandit (GLRB) is an algorithmic framework for sequential decision-making that combines treatment action selection with minimal, feasible adjustments to mutable features.
  • It leverages a generalized linear model and a biconvex coordinate descent approach to jointly optimize parameter estimates and recourse decisions under bounded constraints.
  • GLRB achieves provably optimal regret bounds and demonstrates superior performance in both synthetic and clinical evaluations compared to traditional bandit methods.

The Generalized Linear Recourse Bandit (GLRB) is an algorithmic framework for sequential decision-making that extends contextual bandits to incorporate algorithmic recourse, specifically focusing on settings where individualized recommendations must combine both action selection and feasible modifications to mutable features. The framework is grounded in a generalized linear model (GLM) reward structure and operationalizes the recourse bandit problem, where the goal is to jointly select a treatment action and propose a minimal, constrained modification to mutable context variables to optimize outcomes. GLRB achieves provably optimal regret bounds and accommodates both practical domain constraints (e.g., clinical plausibility) and robust statistical guarantees (Cao et al., 17 Jan 2026).

1. Problem Statement and Mathematical Formulation

GLRB formalizes the recourse bandit problem in a sequential, contextual setting. At each round t=1,,Tt=1,\ldots,T, a learner observes a context xt=(xt,I,xt,M)RdI×RdM=Rdx_t = (x_{t,I}, x_{t,M}) \in \mathbb{R}^{d_I} \times \mathbb{R}^{d_M} = \mathbb{R}^d, partitioned into "immutable" features xt,Ix_{t,I} and "mutable" features xt,Mx_{t,M}. The learner selects both an action aAa \in \mathcal{A} (A=K|\mathcal{A}| = K) and a recourse vector δRdM\delta \in \mathbb{R}^{d_M} satisfying δγ\|\delta\| \leq \gamma, where γ\gamma is the recourse radius controlling the maximal allowable modification.

The realized context becomes xt(δ)=(xt,I,xt,M+δ)x_t(\delta) = (x_{t,I}, x_{t,M} + \delta). The reward structure adheres to a generalized linear model: if action aa is selected at context xx, the observed reward is

rt=μ(θax)+ξt,r_t = \mu(\theta_a^\top x) + \xi_t,

where μ\mu is a known, strictly increasing, LμL_\mu-Lipschitz link function with μ()cμ>0\mu'(\cdot) \geq c_\mu > 0, θaRd\theta_a \in \mathbb{R}^d is an unknown arm parameter, and ξt\xi_t is zero-mean σ\sigma-sub-Gaussian noise. Parameter and context norms are bounded (x2βH\|x\|_2 \leq \beta_\mathcal{H}, θa2βΘ\|\theta_a\|_2 \leq \beta_\Theta).

The offline recourse-optimal pair for known parameters solves: maxaAmaxδγμ(θa(xt,I,xt,M+δ)).\max_{a \in \mathcal{A}}\,\,\max_{\|\delta\| \leq \gamma} \,\, \mu(\theta_a^\top (x_{t,I}, x_{t,M} + \delta)). The recourse regret over TT rounds is

RT=E[t=1T(μ(θatxt(δt))μ(θatxt(δt)))],R_T = \mathbb{E}\left[ \sum_{t=1}^T \left(\mu(\theta_{a^*_t}^\top x_t(\delta^*_t)) - \mu(\theta_{a_t}^\top x_t(\delta_t))\right) \right],

where (at,δt)(a^*_t, \delta^*_t) denote the context-optimal pair.

2. Algorithmic Structure of GLRB

GLRB integrates contextual parameter estimation with recourse optimization via the following procedural components:

2.1 Parameter Estimation and Confidence Sets

Parameters for each action aa are estimated via regularized maximum likelihood estimation (MLE) for GLMs: θ^t,a=argminθsIt,a(rs,xsθ)+λ2θ22,\hat\theta_{t,a} = \arg\min_\theta \sum_{s \in I_{t,a}} \ell(r_s, x_s^\top \theta) + \frac{\lambda}{2} \|\theta\|_2^2, where It,a={s<t:as=a}I_{t,a} = \{s < t: a_s = a\} and (r,z)=mz+m(z)\ell(r,z) = mz + m(z) with m=μm' = \mu. The confidence set for arm aa at round tt is

Θt,a={θ:θθ^t,aVt,aρt,a},\Theta_{t,a} = \left\{ \theta : \|\theta - \hat\theta_{t,a}\|_{V_{t,a}} \leq \rho_{t,a} \right\},

where Vt,a=λI+sIt,axsxsV_{t,a} = \lambda I + \sum_{s \in I_{t,a}} x_s x_s^\top and ρt,a=O(1cμ(σdlog(1+βH2nt,a/λ)+dlog(K/δ)+λβΘ))\rho_{t,a} = O\left(\frac{1}{c_\mu} \left(\sigma\sqrt{d\log(1+\beta_\mathcal{H}^2 n_{t,a}/\lambda) + d\log(K/\delta)} + \sqrt{\lambda}\,\beta_\Theta\right)\right).

2.2 Optimistic Recourse Optimization (ORO-Arm)

For each arm aa at round tt, GLRB solves a biconvex maximization: (δt,a,θt,a)argmaxδγ,  θΘt,a(xt,I,xt,M+δ)θ.(\delta_{t,a}, \theta_{t,a}) \in \arg\max_{\|\delta\| \leq \gamma,\; \theta \in \Theta_{t,a}}\, (x_{t,I}, x_{t,M} + \delta)^\top \theta. This is performed by two-block coordinate descent, alternating between closed-form updates for θ\theta and δ\delta:

  • Fix δ\delta, update θ\theta as θ^t,a+ρt,a((xt,I,xt,M+δ))Vt,a1\hat\theta_{t,a} + \rho_{t,a} \cdot \partial((x_{t,I}, x_{t,M}+\delta))_{V_{t,a}^{-1}},
  • Fix θ\theta, update δ\delta as γ(θM)\gamma \cdot \partial(\theta_M)_*, with f\partial f denoting a subgradient and \|\cdot\|_* the dual norm.

2.3 Action and Recourse Selection, and Update

From candidate pairs {(a,δt,a)}\{(a, \delta_{t,a})\}, the pair maximizing the optimistic objective is selected, the chosen context modification is implemented, and the reward observed. Parameter estimates and covariance matrices are updated accordingly.

3. Theoretical Guarantees

GLRB achieves optimal regret properties under standard boundedness and regularity assumptions:

  • High-Probability Confidence: With probability at least 1δ1-\delta, the true parameter θa\theta_a^* for each arm remains within the confidence set Θt,a\Theta_{t,a} for all t,at, a.
  • Regret Bound: The recourse regret satisfies

RT2LμρT2dKTlog(1+βH2T/(λd))=O~(dKT)R_T \leq 2L_\mu \rho_T \sqrt{2dKT\log\left(1 + \beta_\mathcal{H}^2 T/(\lambda d)\right)} = \tilde{O}(d\sqrt{KT})

with probability at least 1δ1-\delta.

  • Matching Lower Bound: Any algorithm must incur

RT=Ω(γdMKTdKT),R_T = \Omega (\gamma d_M \sqrt{KT} \vee \sqrt{dKT}),

and, with an η\eta-suboptimal recourse oracle, Ω((γdMKTdKT)ηT)\Omega((\gamma d_M \sqrt{KT} \vee \sqrt{dKT}) \wedge \eta T).

  • Proof Techniques: Proofs leverage self-normalized martingale bounds for GLM MLE, the elliptical potential lemma for bounding context uncertainty, and optimism guarantees within confidence sets.

4. Empirical Evaluation

Empirical validation compares GLRB to LinUCB, an LLM-only approach, and LIBRA (which leverages LLMs for warm-starting):

  • Synthetic Evaluation (d = 20, K = 10): GLRB demonstrates sublinear regret, outperforming LinUCB (linear regret growth) and LLM-only (bias-induced suboptimality).
  • Clinical Case Study (ACCORD dataset, hypertension management): Immutable features include age, gender, cardiovascular disease history; mutable features are DietScore, PhyActHours. Action set includes two treatment regimens (β-blocker and ACE+CCB+diuretic). The outcome metric is 170SBPnext visit170 - SBP_{\text{next visit}}. GLRB achieves superior cumulative regret and systolic blood pressure reduction, with LIBRA further enhancing early phase performance through GPT-based warm-starting.
  • Key Metrics: Recourse regret, sample efficiency, SBP drop, LLM query count (for LIBRA).

5. Model Assumptions and Practical Implementation

GLRB is based on the following conditions and addresses critical implementation considerations:

  • Reward Model Assumptions: Link function μ\mu is strictly increasing, LμL_\mu-Lipschitz, with derivative μcμ>0\mu' \geq c_\mu>0. Noise ξt\xi_t is conditionally σ\sigma-sub-Gaussian.
  • Feasibility: Contexts and parameters are bounded; the recourse radius γ\gamma constrains feasible modifications.
  • Norms: Typically Euclidean or Mahalanobis; for general semialgebraic norms, the coordinate-descent solver converges to a critical point under the Kurdyka–Łojasiewicz property.
  • Algorithmic Efficiency: ORO subproblems per arm are solved efficiently via coordinate descent or semi-definite programming for Mahalanobis norms.
  • Parameter Calibration: γ\gamma is chosen for domain plausibility; regularization and error thresholds are set by cross-validation.
  • Human-in-the-Loop: Confidence radii may trigger domain-expert intervention, as operationalized in the LIBRA extension.

6. Extensions, Applications, and Broader Implications

GLRB supports multiple extensions and is amenable to deployment in high-stakes, recourse-relevant sectors:

  • Model and Environment Extensions:
    • Non-compliance modeling (adversarial and random), with robust/stochastic ORO variants.
    • Nonlinear and nonparametric reward modeling.
    • Meta-bandit approaches for expert oracle (e.g., multiple LLM) selection.
    • Multi-objective optimization balancing recourse cost, rewards, and preferences.
  • Practical Deployment:
    • Efficient ORO solvers, rigorous parameter selection, and monitoring for autonomy thresholds.
  • Applicability: Personalized medicine, credit allocation, and hiring practices benefit from algorithms capable of simultaneous action and feasible, minimal recourse proposals, with provable regret and fallback guarantees.

GLRB thus provides a principled approach to bandit learning in complex environments where recourse is not only possible but critical, integrating statistical rigor with actionable recommendations to facilitate trustworthy decision-making in sensitive domains (Cao et al., 17 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized Linear Recourse Bandit (GLRB).