Generalized Linear Recourse Bandit

Updated 24 January 2026

Generalized Linear Recourse Bandit (GLRB) is an algorithmic framework for sequential decision-making that combines treatment action selection with minimal, feasible adjustments to mutable features.
It leverages a generalized linear model and a biconvex coordinate descent approach to jointly optimize parameter estimates and recourse decisions under bounded constraints.
GLRB achieves provably optimal regret bounds and demonstrates superior performance in both synthetic and clinical evaluations compared to traditional bandit methods.

The Generalized Linear Recourse Bandit (GLRB) is an algorithmic framework for sequential decision-making that extends contextual bandits to incorporate algorithmic recourse, specifically focusing on settings where individualized recommendations must combine both action selection and feasible modifications to mutable features. The framework is grounded in a generalized linear model (GLM) reward structure and operationalizes the recourse bandit problem, where the goal is to jointly select a treatment action and propose a minimal, constrained modification to mutable context variables to optimize outcomes. GLRB achieves provably optimal regret bounds and accommodates both practical domain constraints (e.g., clinical plausibility) and robust statistical guarantees (Cao et al., 17 Jan 2026).

1. Problem Statement and Mathematical Formulation

GLRB formalizes the recourse bandit problem in a sequential, contextual setting. At each round $t=1,\ldots,T$ , a learner observes a context $x_t = (x_{t,I}, x_{t,M}) \in \mathbb{R}^{d_I} \times \mathbb{R}^{d_M} = \mathbb{R}^d$ , partitioned into "immutable" features $x_{t,I}$ and "mutable" features $x_{t,M}$ . The learner selects both an action $a \in \mathcal{A}$ ( $|\mathcal{A}| = K$ ) and a recourse vector $\delta \in \mathbb{R}^{d_M}$ satisfying $\|\delta\| \leq \gamma$ , where $\gamma$ is the recourse radius controlling the maximal allowable modification.

The realized context becomes $x_t(\delta) = (x_{t,I}, x_{t,M} + \delta)$ . The reward structure adheres to a generalized linear model: if action $a$ is selected at context $x$ , the observed reward is

$r_t = \mu(\theta_a^\top x) + \xi_t,$

where $\mu$ is a known, strictly increasing, $L_\mu$ -Lipschitz link function with $\mu'(\cdot) \geq c_\mu > 0$ , $\theta_a \in \mathbb{R}^d$ is an unknown arm parameter, and $\xi_t$ is zero-mean $\sigma$ -sub-Gaussian noise. Parameter and context norms are bounded ( $\|x\|_2 \leq \beta_\mathcal{H}$ , $\|\theta_a\|_2 \leq \beta_\Theta$ ).

The offline recourse-optimal pair for known parameters solves: $\max_{a \in \mathcal{A}}\,\,\max_{\|\delta\| \leq \gamma} \,\, \mu(\theta_a^\top (x_{t,I}, x_{t,M} + \delta)).$ The recourse regret over $T$ rounds is

$R_T = \mathbb{E}\left[ \sum_{t=1}^T \left(\mu(\theta_{a^*_t}^\top x_t(\delta^*_t)) - \mu(\theta_{a_t}^\top x_t(\delta_t))\right) \right],$

where $(a^*_t, \delta^*_t)$ denote the context-optimal pair.

2. Algorithmic Structure of GLRB

GLRB integrates contextual parameter estimation with recourse optimization via the following procedural components:

2.1 Parameter Estimation and Confidence Sets

Parameters for each action $a$ are estimated via regularized maximum likelihood estimation (MLE) for GLMs: $\hat\theta_{t,a} = \arg\min_\theta \sum_{s \in I_{t,a}} \ell(r_s, x_s^\top \theta) + \frac{\lambda}{2} \|\theta\|_2^2,$ where $I_{t,a} = \{s < t: a_s = a\}$ and $\ell(r,z) = mz + m(z)$ with $m' = \mu$ . The confidence set for arm $a$ at round $t$ is

$\Theta_{t,a} = \left\{ \theta : \|\theta - \hat\theta_{t,a}\|_{V_{t,a}} \leq \rho_{t,a} \right\},$

where $V_{t,a} = \lambda I + \sum_{s \in I_{t,a}} x_s x_s^\top$ and $\rho_{t,a} = O\left(\frac{1}{c_\mu} \left(\sigma\sqrt{d\log(1+\beta_\mathcal{H}^2 n_{t,a}/\lambda) + d\log(K/\delta)} + \sqrt{\lambda}\,\beta_\Theta\right)\right)$ .

2.2 Optimistic Recourse Optimization (ORO-Arm)

For each arm $a$ at round $t$ , GLRB solves a biconvex maximization: $(\delta_{t,a}, \theta_{t,a}) \in \arg\max_{\|\delta\| \leq \gamma,\; \theta \in \Theta_{t,a}}\, (x_{t,I}, x_{t,M} + \delta)^\top \theta.$ This is performed by two-block coordinate descent, alternating between closed-form updates for $\theta$ and $\delta$ :

Fix $\delta$ , update $\theta$ as $\hat\theta_{t,a} + \rho_{t,a} \cdot \partial((x_{t,I}, x_{t,M}+\delta))_{V_{t,a}^{-1}}$ ,
Fix $\theta$ , update $\delta$ as $\gamma \cdot \partial(\theta_M)_*$ , with $\partial f$ denoting a subgradient and $\|\cdot\|_*$ the dual norm.

2.3 Action and Recourse Selection, and Update

From candidate pairs $\{(a, \delta_{t,a})\}$ , the pair maximizing the optimistic objective is selected, the chosen context modification is implemented, and the reward observed. Parameter estimates and covariance matrices are updated accordingly.

3. Theoretical Guarantees

GLRB achieves optimal regret properties under standard boundedness and regularity assumptions:

High-Probability Confidence: With probability at least $1-\delta$ , the true parameter $\theta_a^*$ for each arm remains within the confidence set $\Theta_{t,a}$ for all $t, a$ .
Regret Bound: The recourse regret satisfies

$R_T \leq 2L_\mu \rho_T \sqrt{2dKT\log\left(1 + \beta_\mathcal{H}^2 T/(\lambda d)\right)} = \tilde{O}(d\sqrt{KT})$

with probability at least $1-\delta$ .

Matching Lower Bound: Any algorithm must incur

$R_T = \Omega (\gamma d_M \sqrt{KT} \vee \sqrt{dKT}),$

and, with an $\eta$ -suboptimal recourse oracle, $\Omega((\gamma d_M \sqrt{KT} \vee \sqrt{dKT}) \wedge \eta T)$ .

Proof Techniques: Proofs leverage self-normalized martingale bounds for GLM MLE, the elliptical potential lemma for bounding context uncertainty, and optimism guarantees within confidence sets.

4. Empirical Evaluation

Empirical validation compares GLRB to LinUCB, an LLM-only approach, and LIBRA (which leverages LLMs for warm-starting):

Synthetic Evaluation (d = 20, K = 10): GLRB demonstrates sublinear regret, outperforming LinUCB (linear regret growth) and LLM-only (bias-induced suboptimality).
Clinical Case Study (ACCORD dataset, hypertension management): Immutable features include age, gender, cardiovascular disease history; mutable features are DietScore, PhyActHours. Action set includes two treatment regimens (β-blocker and ACE+CCB+diuretic). The outcome metric is $170 - SBP_{\text{next visit}}$ . GLRB achieves superior cumulative regret and systolic blood pressure reduction, with LIBRA further enhancing early phase performance through GPT-based warm-starting.
Key Metrics: Recourse regret, sample efficiency, SBP drop, LLM query count (for LIBRA).

5. Model Assumptions and Practical Implementation

GLRB is based on the following conditions and addresses critical implementation considerations:

Reward Model Assumptions: Link function $\mu$ is strictly increasing, $L_\mu$ -Lipschitz, with derivative $\mu' \geq c_\mu>0$ . Noise $\xi_t$ is conditionally $\sigma$ -sub-Gaussian.
Feasibility: Contexts and parameters are bounded; the recourse radius $\gamma$ constrains feasible modifications.
Norms: Typically Euclidean or Mahalanobis; for general semialgebraic norms, the coordinate-descent solver converges to a critical point under the Kurdyka–Łojasiewicz property.
Algorithmic Efficiency: ORO subproblems per arm are solved efficiently via coordinate descent or semi-definite programming for Mahalanobis norms.
Parameter Calibration: $\gamma$ is chosen for domain plausibility; regularization and error thresholds are set by cross-validation.
Human-in-the-Loop: Confidence radii may trigger domain-expert intervention, as operationalized in the LIBRA extension.

6. Extensions, Applications, and Broader Implications

GLRB supports multiple extensions and is amenable to deployment in high-stakes, recourse-relevant sectors:

Model and Environment Extensions:
- Non-compliance modeling (adversarial and random), with robust/stochastic ORO variants.
- Nonlinear and nonparametric reward modeling.
- Meta-bandit approaches for expert oracle (e.g., multiple LLM) selection.
- Multi-objective optimization balancing recourse cost, rewards, and preferences.
Practical Deployment:
- Efficient ORO solvers, rigorous parameter selection, and monitoring for autonomy thresholds.
Applicability: Personalized medicine, credit allocation, and hiring practices benefit from algorithms capable of simultaneous action and feasible, minimal recourse proposals, with provable regret and fallback guarantees.

GLRB thus provides a principled approach to bandit learning in complex environments where recourse is not only possible but critical, integrating statistical rigor with actionable recommendations to facilitate trustworthy decision-making in sensitive domains (Cao et al., 17 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

LIBRA: Language Model Informed Bandit Recourse Algorithm for Personalized Treatment Planning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized Linear Recourse Bandit (GLRB).