Task Sampling Bandits

Updated 12 February 2026

Task Sampling Bandits are adaptive methods that allocate computational resources across multiple tasks using multi-armed bandit strategies.
They leverage techniques like posterior aggregation and gradient-based influence to enable robust transfer in multi-task and neural combinatorial optimization.
The framework provides theoretical regret guarantees and addresses challenges in inter-task similarity and adaptive task scheduling.

Task sampling bandits are a class of methodologies for adaptively allocating learning or computational resources across multiple tasks using multi-armed bandit (MAB) or related online decision-making algorithms. The canonical objective is to improve overall performance in multi-task learning or transfer learning settings by leveraging inter-task relationships, empirical reward aggregation, or influence-based criteria to select which task to optimize at each round. This approach is prominent both in robust transfer learning for bandit settings and in accelerating neural multi-task combinatorial optimization, among other domains.

1. Problem Formulation and Multi-Task Bandit Frameworks

In task sampling bandits, the foundational structure is a multi-task bandit environment. Here, there are $M$ tasks (often termed "players") and each task faces a $K$ -armed stochastic bandit with independent, unknown reward distributions. For task $p$ and arm $i$ , the reward distribution is denoted $D_i^p$ , supported on $[0,1]$ , with mean $\mu_i^p$ . The inter-task similarity is enforced by a dissimilarity parameter $\epsilon\in[0,1]$ , such that $|\mu_i^p-\mu_i^q|\leq\epsilon$ for all tasks $p,q$ and arms $i$ .

At each round $t$ , a possibly varying subset of tasks $P_t\subset [M]$ becomes active (chosen by an adversary or according to some schedule). Each active task $p\in P_t$ independently selects an arm $i_t^p$ , observes reward $r_t^p\sim D_{i_t^p}^p$ , and shares $(i_t^p, r_t^p)$ with other tasks. The aggregate regret is

$R(T) = \sum_{p=1}^M \sum_{t: p\in P_t} (\mu_*^p-\mu_{i_t^p}^p) = \sum_{i=1}^K\sum_{p=1}^M \mathbb{E}\left[n_i^p(T)\right]\Delta_i^p,$

where $\mu_*^p = \max_j \mu_j^p$ and $\Delta_i^p = \mu_*^p - \mu_i^p$ (Wang et al., 2022).

In deep multi-task learning for combinatorial optimization, the task set $\mathcal{T}=\{T_j^i\}$ indexes every combinatorial problem type and instance size, and a unified model is updated by selecting one task per round for training. Task selection becomes a sequential decision process, naturally cast as a bandit task-sampling problem (Wang et al., 2023).

2. Core Algorithms: Transfer, Posterior Sampling, and Influence-based Scheduling

A range of algorithmic strategies are used to implement task sampling bandits:

2.1 Robust Transfer via Posterior Aggregation

The "TS-RoboAgg" algorithm exemplifies transfer-aware posterior sampling for multi-task bandits (Wang et al., 2022). Its key features are:

Posterior selection: Each task, for each arm, maintains both an individual posterior (based only on its own data) and an aggregate posterior (combining data from all tasks), both Gaussian.
Adaptive trust: Early in training (when $n_i^p$ is small), decisions are based on the aggregate posterior (with an added bias $\epsilon$ ), reflecting trust in peer information. Once sufficient individual experience accumulates ( $n_i^p \geq c_1\ln T/\epsilon^2+2M$ ), the algorithm switches to the unbiased, individual posterior—preventing negative transfer from dissimilar tasks.
Sampling and selection: In each round, tasks sample from their chosen posterior for each arm, select the arm with maximal sample value, observe reward, and update only the corresponding (individual and aggregate) posterior and counts.

This switchover threshold is crucial for robust performance, balancing rapid subpar-arm elimination via transfer against late-stage protection from bias (Wang et al., 2022).

2.2 Gradient-Influence Bandit Sampling for Neural Multi-Task Learning

For deep neural combinatorial solvers, the primary innovation is the use of a loss gradient–decomposition to define task-level scalar rewards (Wang et al., 2023):

Reward construction: For tasks $T_j^i$ , rewards are derived from cosine similarities between the gradients produced by training on one task and their resultant effects (influence) on all tasks (captured in a sparse, block-structured "influence matrix").
Bandit selection: These reward signals feed into an adversarial bandit algorithm (typically Exp3, sometimes with discounting or restarts), which adaptively samples tasks to train at each step, favoring those with high influence on overall progress.

Empirically, this bandit-based sampling narrows optimality gaps under tight budgets and accelerates convergence relative to uniform or hand-tuned scheduling—at negligible extra per-step computation (Wang et al., 2023).

3. Theoretical Performance Guarantees

Task sampling bandits exhibit varying regret properties based on the employed algorithm and the assumed structure of task similarity.

3.1 Gap-dependent and Gap-independent Regret Bounds

For the TS-RoboAgg algorithm in robust transfer, gap-dependent collective regret is bounded as

$R(T) \leq O\bigg(\frac{1}{M}\sum_{i\in I_{10\epsilon}}\sum_{p:\Delta_i^p>0}\frac{\ln T}{\Delta_i^p} + \sum_{i\notin I_{10\epsilon}}\sum_{p:\Delta_i^p>0}\frac{\ln T}{\Delta_i^p} + M^2K\bigg)$

with $I_{10\epsilon}$ the set of arms suboptimal by at least $10\epsilon$ for some task (Wang et al., 2022).

Gap-independent regret (for total pulls $P=\sum_t |P_t|$ ) obeys

$R(T) \lesssim \sqrt{|I_{10\epsilon}|\,P} + \sqrt{M(|I_{10\epsilon}^C|-1)\,P} + M^2K,$

with improvements scaling as $O(1/M)$ or $O(1/\sqrt{M})$ over independent task learning in shared gap regimes (Wang et al., 2022).

3.2 Bandit MTL Sampling and Lack of End-to-end Regret Guarantees

For MTL combinatorial optimization with loss-gradient rewards, classical Exp3 adversarial regret bounds $O(\sqrt{T|\mathcal{T}|\ln|\mathcal{T}|})$ apply formally once rewards are constructed, but no explicit regret bound is provided due to the inexact nature of reward construction and approximations in gradient estimation (Wang et al., 2023).

4. Empirical Evaluations and Observed Performance

4.1 Multi-task Bandit Transfer

Empirical assessments (M=20 tasks, K=10 arms, $\epsilon=0.15$ , $T=50,000$ ) show TS-RoboAgg consistently achieves the lowest cumulative regret, outperforming both individual TS and UCB-based transfer. Gains are pronounced when a moderate or large proportion of arms are subpar, with robust performance maintained even when no subpar arms exist (Wang et al., 2022).

4.2 Neural Combinatorial Solvers

On 12-task combinatorial benchmarks (e.g., TSP/CVRP/OP/KP at three instance sizes each), bandit-based task sampling halves the mean optimality gap under tight training budgets and matches the per-task performance of strong single-task learners with substantially reduced model count and training time. Exp3-based samplers outperform pure Thompson Sampling due to improved handling of nonstationary task rewards and adversarial task dependencies (Wang et al., 2023).

5. Practical Considerations, Robustness, and Open Problems

5.1 Robustness Mechanisms

Robust transfer is achieved by (i) using aggregate posteriors at low sample counts (variance reduction, bias acceptance), (ii) clear switchover to unbiased per-task posteriors once sufficient data accrues, and (iii) incorporating an explicit similarity parameter $\epsilon$ into both algorithm selection and regret bounds, thus localizing model risk (Wang et al., 2022). In task-influence scheduling, robustness arises from frequent reward recomputation and block-diagonal separation in the influence matrix, effectively partitioning unrelated tasks (Wang et al., 2023).

5.2 Limitations

The transfer algorithms require knowledge of $\epsilon$ , and sublinear regret cannot be guaranteed for all $\epsilon$ if unknown.
Lower-order additive terms proportional to $M^2K$ or higher per-step cost arise from initialization and structure.
The bandit-influenced neural multi-task strategies lack end-to-end theoretical guarantees and rely on empirical validation.
Bandit-driven task schedulers are not "anytime"—T must be known in advance for some algorithms.

5.3 Open Research Directions

Tightening the gap threshold for transfer regret bounds, potentially matching best-known UCB-based rates (Wang et al., 2022).
Extending methodologies to contextual or non-Bernoulli rewards and to cases with unknown task similarity.
Application to settings such as federated learning, distributed robotics, or other domains where cross-task nonstationarity and communications constraints are present.
Formal regret analysis for influence-matrix-based bandit scheduling as used in neural MTL (Wang et al., 2023).

6. Connection to Broader Bandit and Resource Allocation Literature

Task sampling bandits lie at the intersection of classical MAB theory, robust transfer learning, and resource-efficient optimization. Algorithms such as Information-Directed Sampling (IDS) (Hirling et al., 23 Dec 2025), Dirichlet Sampling (Baudry et al., 2021), and recent computationally-adaptive Thompson Sampling algorithms (Hu et al., 2024) provide alternate perspectives on exploration-exploitation, information gain, and robust decision making. Despite differences in model assumptions and objectives, a unifying theme is the principled allocation of decision steps to maximize overall learning under structural constraints—be they task similarity, influence, or computational resource budgets.

Key Papers:

"Thompson Sampling for Robust Transfer in Multi-Task Bandits" (Wang et al., 2022)
"Efficient Training of Multi-task Neural Solver for Combinatorial Optimization" (Wang et al., 2023)
"Efficient and Adaptive Posterior Sampling Algorithms for Bandits" (Hu et al., 2024)
"Information-directed sampling for bandits: a primer" (Hirling et al., 23 Dec 2025)
"From Optimality to Robustness: Dirichlet Sampling Strategies in Stochastic Bandits" (Baudry et al., 2021)