Generalized Linear Recourse Bandit
- Generalized Linear Recourse Bandit (GLRB) is an algorithmic framework for sequential decision-making that combines treatment action selection with minimal, feasible adjustments to mutable features.
- It leverages a generalized linear model and a biconvex coordinate descent approach to jointly optimize parameter estimates and recourse decisions under bounded constraints.
- GLRB achieves provably optimal regret bounds and demonstrates superior performance in both synthetic and clinical evaluations compared to traditional bandit methods.
The Generalized Linear Recourse Bandit (GLRB) is an algorithmic framework for sequential decision-making that extends contextual bandits to incorporate algorithmic recourse, specifically focusing on settings where individualized recommendations must combine both action selection and feasible modifications to mutable features. The framework is grounded in a generalized linear model (GLM) reward structure and operationalizes the recourse bandit problem, where the goal is to jointly select a treatment action and propose a minimal, constrained modification to mutable context variables to optimize outcomes. GLRB achieves provably optimal regret bounds and accommodates both practical domain constraints (e.g., clinical plausibility) and robust statistical guarantees (Cao et al., 17 Jan 2026).
1. Problem Statement and Mathematical Formulation
GLRB formalizes the recourse bandit problem in a sequential, contextual setting. At each round , a learner observes a context , partitioned into "immutable" features and "mutable" features . The learner selects both an action () and a recourse vector satisfying , where is the recourse radius controlling the maximal allowable modification.
The realized context becomes . The reward structure adheres to a generalized linear model: if action is selected at context , the observed reward is
where is a known, strictly increasing, -Lipschitz link function with , is an unknown arm parameter, and is zero-mean -sub-Gaussian noise. Parameter and context norms are bounded (, ).
The offline recourse-optimal pair for known parameters solves: The recourse regret over rounds is
where denote the context-optimal pair.
2. Algorithmic Structure of GLRB
GLRB integrates contextual parameter estimation with recourse optimization via the following procedural components:
2.1 Parameter Estimation and Confidence Sets
Parameters for each action are estimated via regularized maximum likelihood estimation (MLE) for GLMs: where and with . The confidence set for arm at round is
where and .
2.2 Optimistic Recourse Optimization (ORO-Arm)
For each arm at round , GLRB solves a biconvex maximization: This is performed by two-block coordinate descent, alternating between closed-form updates for and :
- Fix , update as ,
- Fix , update as , with denoting a subgradient and the dual norm.
2.3 Action and Recourse Selection, and Update
From candidate pairs , the pair maximizing the optimistic objective is selected, the chosen context modification is implemented, and the reward observed. Parameter estimates and covariance matrices are updated accordingly.
3. Theoretical Guarantees
GLRB achieves optimal regret properties under standard boundedness and regularity assumptions:
- High-Probability Confidence: With probability at least , the true parameter for each arm remains within the confidence set for all .
- Regret Bound: The recourse regret satisfies
with probability at least .
- Matching Lower Bound: Any algorithm must incur
and, with an -suboptimal recourse oracle, .
- Proof Techniques: Proofs leverage self-normalized martingale bounds for GLM MLE, the elliptical potential lemma for bounding context uncertainty, and optimism guarantees within confidence sets.
4. Empirical Evaluation
Empirical validation compares GLRB to LinUCB, an LLM-only approach, and LIBRA (which leverages LLMs for warm-starting):
- Synthetic Evaluation (d = 20, K = 10): GLRB demonstrates sublinear regret, outperforming LinUCB (linear regret growth) and LLM-only (bias-induced suboptimality).
- Clinical Case Study (ACCORD dataset, hypertension management): Immutable features include age, gender, cardiovascular disease history; mutable features are DietScore, PhyActHours. Action set includes two treatment regimens (β-blocker and ACE+CCB+diuretic). The outcome metric is . GLRB achieves superior cumulative regret and systolic blood pressure reduction, with LIBRA further enhancing early phase performance through GPT-based warm-starting.
- Key Metrics: Recourse regret, sample efficiency, SBP drop, LLM query count (for LIBRA).
5. Model Assumptions and Practical Implementation
GLRB is based on the following conditions and addresses critical implementation considerations:
- Reward Model Assumptions: Link function is strictly increasing, -Lipschitz, with derivative . Noise is conditionally -sub-Gaussian.
- Feasibility: Contexts and parameters are bounded; the recourse radius constrains feasible modifications.
- Norms: Typically Euclidean or Mahalanobis; for general semialgebraic norms, the coordinate-descent solver converges to a critical point under the Kurdyka–Łojasiewicz property.
- Algorithmic Efficiency: ORO subproblems per arm are solved efficiently via coordinate descent or semi-definite programming for Mahalanobis norms.
- Parameter Calibration: is chosen for domain plausibility; regularization and error thresholds are set by cross-validation.
- Human-in-the-Loop: Confidence radii may trigger domain-expert intervention, as operationalized in the LIBRA extension.
6. Extensions, Applications, and Broader Implications
GLRB supports multiple extensions and is amenable to deployment in high-stakes, recourse-relevant sectors:
- Model and Environment Extensions:
- Non-compliance modeling (adversarial and random), with robust/stochastic ORO variants.
- Nonlinear and nonparametric reward modeling.
- Meta-bandit approaches for expert oracle (e.g., multiple LLM) selection.
- Multi-objective optimization balancing recourse cost, rewards, and preferences.
- Practical Deployment:
- Efficient ORO solvers, rigorous parameter selection, and monitoring for autonomy thresholds.
- Applicability: Personalized medicine, credit allocation, and hiring practices benefit from algorithms capable of simultaneous action and feasible, minimal recourse proposals, with provable regret and fallback guarantees.
GLRB thus provides a principled approach to bandit learning in complex environments where recourse is not only possible but critical, integrating statistical rigor with actionable recommendations to facilitate trustworthy decision-making in sensitive domains (Cao et al., 17 Jan 2026).