Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Objective Linear Contextual Bandits

Updated 3 December 2025
  • Multi-objective linear contextual bandits are frameworks that optimize several conflicting linear objectives using contextual information in a sequential decision process.
  • Algorithmic approaches like MOGLB-UCB and MOL-TS utilize upper confidence bounds and Thompson sampling to estimate parameters and construct approximate Pareto fronts.
  • Theoretical guarantees and empirical results demonstrate near-optimal Pareto regret bounds and effective trade-off management in applications such as personalized recommendations and resource allocation.

The multi-objective linear contextual bandit problem extends the classical stochastic contextual bandit framework by requiring the simultaneous optimization of multiple, possibly conflicting, linear objectives based on contextual information. At each round, the learner selects an arm associated with a context vector and receives a vector-valued reward. The goal is to minimize Pareto regret, a metric quantifying proximity to the Pareto-optimal set of actions, rather than maximizing a single scalarized reward. This paradigm is central in applications such as personalized recommendations, resource allocation, and other multi-criteria decision processes, especially where explicit trade-offs among objectives must be managed.

1. Formal Problem Statement and Pareto Regret

At time tt, the learner observes a finite or infinite arm set AtRd\mathcal{A}_t\subset\mathbb{R}^d, where each arm aa is associated with a context vector xa,tRdx_{a,t}\in\mathbb{R}^d. Upon selecting arm ata_t, the learner receives a stochastic reward vector rt=(rt,1,,rt,m)Rmr_t=(r_{t,1},\ldots,r_{t,m})\in\mathbb{R}^m, such that for each objective ii: $r_{t,i} = x_{a_t,t}^\top \theta_i + \eta_{t,i},\qquad \eta_{t,i}\ \text{is zero-mean, %%%%7%%%%-subgaussian}$ with unknown parameters θiRd\theta_i\in\mathbb{R}^d, i=1,,mi=1,\dots,m.

The expected reward vector for arm AtRd\mathcal{A}_t\subset\mathbb{R}^d0 at AtRd\mathcal{A}_t\subset\mathbb{R}^d1 is AtRd\mathcal{A}_t\subset\mathbb{R}^d2. Pareto dominance is defined as: AtRd\mathcal{A}_t\subset\mathbb{R}^d3 dominates AtRd\mathcal{A}_t\subset\mathbb{R}^d4 (denoted AtRd\mathcal{A}_t\subset\mathbb{R}^d5) iff AtRd\mathcal{A}_t\subset\mathbb{R}^d6 for all AtRd\mathcal{A}_t\subset\mathbb{R}^d7 and AtRd\mathcal{A}_t\subset\mathbb{R}^d8 for some AtRd\mathcal{A}_t\subset\mathbb{R}^d9. The Pareto front aa0 consists of arms not dominated by any other: aa1 Pareto regret, the key performance metric, is given by

aa2

where aa3 and aa4. This represents the minimal uniform increment needed to move aa5 to the Pareto front (Park et al., 30 Nov 2025, Lu et al., 2019).

2. Algorithmic Approaches

Several distinct methodologies have been proposed, notably UCB-based and Thompson sampling–based algorithms, to address the exploration-exploitation tradeoff in this multi-objective setting.

Upper Confidence Bound Approaches (MOGLB-UCB)

For the case where the reward follows a (possibly generalized) linear model, the MOGLB-UCB algorithm maintains, for each objective, an online Newton-type parameter estimate and a confidence ellipsoid. At each round, for each arm and objective, an upper confidence bound (UCB) is constructed: aa6 where aa7 is the online estimate, aa8 the regularization matrix, and aa9 a parameter scaling with dimension and time. An approximate Pareto front xa,tRdx_{a,t}\in\mathbb{R}^d0 is constructed via non-dominance in UCB space, and the algorithm selects an arm uniformly at random from this set. Updates ensue based on observed rewards (Lu et al., 2019).

Thompson Sampling Approaches (MOL-TS)

The MOL-TS algorithm independently samples parameter vectors xa,tRdx_{a,t}\in\mathbb{R}^d1 from the posterior for each objective. For each arm, the induced sampled reward vector xa,tRdx_{a,t}\in\mathbb{R}^d2 defines a “sampled” Pareto front xa,tRdx_{a,t}\in\mathbb{R}^d3. The algorithm selects from xa,tRdx_{a,t}\in\mathbb{R}^d4, observes rewards, and updates Bayesian parameter posteriors. This approach achieves a worst-case Pareto regret bound of xa,tRdx_{a,t}\in\mathbb{R}^d5, closely paralleling the single-objective randomized linear bandit rate (Park et al., 30 Nov 2025).

3. Theoretical Guarantees and Minimax Bounds

State-of-the-art regret bounds are summarized as follows:

Algorithm Regret Bound Assumptions Reference
MOGLB-UCB xa,tRdx_{a,t}\in\mathbb{R}^d6 Generalized linear (incl. linear) (Lu et al., 2019)
MOL-TS xa,tRdx_{a,t}\in\mathbb{R}^d7 Linear reward, subgaussian noise (Park et al., 30 Nov 2025)

For scalarization-based reductions, performance is typically suboptimal in terms of multi-objective regret, particularly in covering the true Pareto front.

Lower bounds indicate that for linear contextual bandits with xa,tRdx_{a,t}\in\mathbb{R}^d8 objectives, the minimax rate matches the single-objective case up to factors depending on the parameter space geometry and the number of objectives (Lu et al., 2019, Park et al., 30 Nov 2025). In constrained variants (e.g., linear costs), the regret scales as xa,tRdx_{a,t}\in\mathbb{R}^d9, where ata_t0 is a constraint threshold and ata_t1 is the known safe cost (Pacchiano et al., 2020, Pacchiano et al., 2024).

4. Extensions: Constraints, Knapsack Structures, and Beyond

Incorporating explicit constraints transforms the problem into a multi-objective control scenario. There are two main constraint models:

  • Stage-wise Linear Constraints: At each round, the selected arm must satisfy cost constraints either with high probability or in expectation. UCB-based “optimistic–pessimistic” methods are used, employing distinct scaling factors for reward and cost confidence sets (Pacchiano et al., 2024).
  • Global Knapsack Constraints: The total accumulated cost across ata_t2 rounds must not exceed a budget. Algorithms reduce the multi-objective problem to an instance of contextual bandits with known Lipschitz concave (or linear) objective, employing importance-weighted estimates and epochs-based policy optimization using oracles over policy classes (Agrawal et al., 2015).

Regret analysis incorporates cost slackness, with minimax lower bounds quantifying the "price of safety" in constraint satisfaction (Pacchiano et al., 2020, Pacchiano et al., 2024). Specific algorithms for knapsack constraints achieve computational efficiency by leveraging coordinate-descent solvers with optimization oracles, maintaining near-optimal rates.

5. Empirical Performance and Practical Considerations

Empirical results are reported across synthetic and real-world datasets:

  • MOGLB-UCB outperforms P-UCB, S-UCB (scalarized), and P-TS baselines in both cumulative Pareto regret and Jaccard similarity to the Pareto front (Lu et al., 2019).
  • MOL-TS achieves lower Pareto and hypervolume regret than scalarization-based TS and UCB variants, and empirically converges to the Pareto front more rapidly (Park et al., 30 Nov 2025).
  • Symmetry of random selection from the (approximate) Pareto front ensures fairness among optimal arms, an important property in multi-objective fairness-sensitive contexts (Lu et al., 2019).
  • In constraint scenarios, empirical cost is always maintained below threshold, while regret grows as cost slack reduces (Pacchiano et al., 2020, Pacchiano et al., 2024).

6. Open Questions and Research Directions

Prospective research areas include:

  • Extension to generalized linear and nonlinear reward models beyond the linear parametrization (Park et al., 30 Nov 2025, Lu et al., 2019).
  • Incorporation of dominant or prioritized objectives, and development of instance-dependent (gap-dependent) Pareto regret lower bounds.
  • Efficient arm selection and regret minimization in high-dimensional, combinatorial, or neural-represented action spaces.
  • Batched or parallel multi-objective selection under multi-constraint structures.
  • Algorithmic improvements for cumulative or non-linear cost and resource constraints, including adaptive confidence scaling and efficient support-lemma based exploitations (Pacchiano et al., 2024, Agrawal et al., 2015).

Recent advances provide a cohesive foundation for multi-objective linear contextual bandits with provably efficient, scalable, and fair learning algorithms, but scaling, expressivity, and nuanced trade-offs among objectives remain active and challenging research areas.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Objective Linear Contextual Bandit Problem.