Multi-Dueling Bandits

Updated 9 February 2026

Multi-Dueling Bandit (MDB) is a framework that compares multiple arms simultaneously using pairwise duels or winner-selection to handle subjective, noisy feedback.
Algorithms like MultiRUCB and SelfSparring efficiently balance exploration and exploitation, achieving favorable cumulative regret bounds in both stochastic and adversarial settings.
MDB underpins scalable online ranker evaluation and active preference learning, addressing challenges such as multileaving distortion and dynamic feedback in complex environments.

The multi-dueling bandit (MDB) problem is an extension of the classical dueling bandit framework, permitting the comparison of multiple arms simultaneously via subjective, noisy feedback, typically in the form of pairwise duels or winner-selection from a subset. MDB settings generalize both the standard (two-dueling) bandit regime and winner-feedback models, with numerous applications in online ranker evaluation, user feedback systems, and active preference learning. Key performance metrics are cumulative regret with respect to a benchmark arm or winner (e.g., Condorcet or Borda), and algorithms are required to efficiently balance exploration and exploitation while contending with inherent feedback stochasticity, possibly in adversarial environments.

1. Formal Definition and Problem Structure

The MDB framework considers a finite set of $K$ arms $\mathcal{X} = \{x_1, \ldots, x_K\}$ , where each arm $x$ possesses an unknown utility $\mu(x)\in[0,1]$ (Du et al., 2022). The learner sequentially selects subsets $A_t \subseteq \mathcal{X}$ of size $|A_t| \leq m$ at each round $t$ (with $m\geq2$ ), and receives noisy comparative feedback. In the stochastic setting, this consists of outcomes of all $\binom{|A_t|}{2}$ duels within $A_t$ ; for adversarial variants, only the single most-preferred ("winner") arm may be observed (Gajane, 2024).

Pairwise win probabilities are determined by a link function, typically $p_{ij} = \phi(\mu(x_i),\mu(x_j)) = (\mu(x_i)-\mu(x_j)+1)/2$ or $p_{ij}$ as specified by a preference matrix $P\in[0,1]^{K\times K}$ , where $p_{ij}=1-p_{ji}$ and $p_{ii}=1/2$ (Du et al., 2022, Brost et al., 2016). In the Condorcet winner paradigm, a unique $x_1$ with $p_{1i}>1/2$ for all $i>1$ is assumed.

The expected cumulative regret is defined as

$\mathbb{E}[R_T] = \sum_{t=1}^T \frac{1}{|A_t|} \sum_{a\in A_t} \Delta(x_1, a)$

with $\Delta(x_1,a) = p_{1a} - 1/2$ (Du et al., 2022), or in the Borda-adversarial setting

$R_T = \sum_{t=1}^T \left(b_t(i^*) - \frac{1}{m}\sum_{i\in A_t} b_t(i)\right)\,,$

where $b_t(i) = \frac{1}{K-1}\sum_{j\not= i} P_t(i,j)$ (Gajane, 2024). For $m=2$ , the setting recovers standard dueling bandits.

2. Algorithmic Advances in Stochastic MDB

Several algorithmic strategies have been developed for the stochastic MDB problem. Early work introduced the extension of the relative UCB principle via the MDB algorithm, maintaining empirical counts of pairwise wins/losses and constructing narrow and wide confidence bounds for adaptive subset selection (Brost et al., 2016). This approach iteratively restricts the arm set using both conservative and aggressive elimination criteria.

A significant development is the MultiRUCB algorithm (Du et al., 2022), which formalizes the candidate set $C_t$ of arms whose upper-confidence relative win probability $u_{ij}(t)$ exceeds $1/2$ against all others. It adaptively chooses $A_t$ based on $|C_t|$ , prioritizing the hypothesized best arm for exploitation where feasible and otherwise resorting to uniform random exploration or subset-based exploitation. The algorithm guarantees that only plausible winner arms (according to UCBs) are repeatedly compared, accelerating the removal of suboptimal candidates as $m$ increases.

The SelfSparring reduction (Sui et al., 2017) maps the MDB problem into parallel stochastic bandit subproblems, employing e.g. Thompson Sampling, and—when arms are correlated—integrates utility modeling via a Gaussian process prior. This leverages smoothness or prior structure to share information across arms, dramatically reducing sampling complexity in high-dimensional or continuous-action spaces.

3. Regret Guarantees and Theoretical Analyses

For the stochastic case with full pairwise feedback, MultiRUCB achieves the first finite-time $O(\ln T)$ expected regret bound for MDB, with scaling constants that decrease as $m$ increases (Du et al., 2022). The core analysis decomposes "mistake" rounds into cases based on the current candidate set and exploits concentration inequalities (Chernoff-Hoeffding) for pairwise estimates. The regret coefficients, such as $4\alpha/\Delta_i^2$ (with $\alpha > 1/2$ and $\Delta_i = p_{1i} - 1/2$ ), scale inversely with the number of dueling arms, as does the term $4\alpha/(C_m^2\Delta_{ij}^2)$ for suboptimal arms, where $C_m^2= m(m-1)/2$ .

SelfSparring achieves asymptotic regret $O(K \ln T / \Delta)$ under the "Approximate Linearity" assumption, which generalizes the linear link function setting (Sui et al., 2017). For dependent arm settings, regret scales with the kernelized information gain, as in standard GP-Bandit theory, though formal finite-time bounds in this regime remain largely open.

In adversarial MDB, the MiDEX algorithm extends the EXP3 framework with clever contest design to simulate effective dueling between two arms using winner-feedback from m-wise subsets (Gajane, 2024). Its expected regret is upper bounded by $O((K\ln K)^{1/3}T^{2/3})$ , matching a lower bound of $\Omega(K^{1/3}T^{2/3})$ up to logarithmic factors, rendering MiDEX minimax-optimal for the adversarial setting.

4. Empirical Evaluation and Practical Design Considerations

Comprehensive synthetic and real-data experiments have established the superiority of modern MDB algorithms relative to dueling-bandit baselines. For example, MultiRUCB achieves the lowest observed cumulative regret and variance across synthetic utility distributions (e.g., $K=48, m=8$ ), outperforming MDB, MultiSparring, and SelfSparring (Du et al., 2022). In ranker evaluation scenarios such as MSLR-WEB30K and large Yahoo datasets, MDB strategies reduce cumulative regret by orders of magnitude compared to relative UCB, MergeRUCB, or RMED1 (Brost et al., 2016). Crucially, the advantage of MDB grows rapidly with the number of arms $K$ and enabled subset size $m$ .

Numerical results also highlight algorithmic robustness to multileaving "distortion" (deviation between true and empirically derived $p_{ij}$ under large $|S_t|$ ) and confirm that practical MDB deployments can scale to hundreds of arms per round. The dual-width UCB parameters $(\alpha,\beta)$ and the choice of the multileaving mechanism are pivotal for balancing exploration and computational load, with the Summed-Output-Score Multileaving (SOSM) approach demonstrating high accuracy and low distortion.

5. Extensions: Dependent Arms and Functional Inference

When arms are dependent, e.g., as parametrized or continuous objects, MDB models integrated with Gaussian process priors have proven effective (Sui et al., 2017). Here, full or partial pairwise observations are incorporated via Bayesian posterior updates, and selection strategies (GP-Thompson sampling, GP-UCB-inspired methods) optimize regret with respect to the unknown arm utility function. The SelfSparring reduction generalizes trivially to this setting: at each round, $m$ Thompson samples from the GP posterior are drawn, their maximum locations determine $A_t$ , and pairwise or winner feedback is used for posterior updates.

The resulting empirical and (in some cases) theoretical improvements are especially pronounced as the degree of functional smoothness (embodied in the GP kernel) increases or when arm pools are large and structured.

6. Adversarial Multi-Dueling and Lower Bounds

Adversarial MDB introduces additional challenges, as the preference matrix may change per round, requiring algorithms to be robust against dynamic, possibly worst-case, preferences (Gajane, 2024). The MiDEX algorithm addresses this via an exponential-weight update over arms and winner-feedback reduction to two-arm dueling. It leverages a pairwise-subset choice model to reconstruct unbiased estimates of pairwise preference, proving that even with only winner feedback among $m$ arms, the regret scaling of $O((K\ln K)^{1/3}T^{2/3})$ is optimal.

A critical theoretical insight is that adversarial MDB cannot improve upon the dueling (m=2) bound; lower-bound reductions show both settings are equivalent in terms of minimax regret scaling.

7. Practical Impact and Open Directions

The MDB paradigm underpins the design of scalable and efficient online ranker evaluation platforms, enabling simultaneous assessment of many candidate algorithms or configurations (Brost et al., 2016). Theoretical advances confirm that regret rates scale favorably with $m$ , with practical algorithms such as MultiRUCB and SelfSparring demonstrating concrete improvements in both regret and computational burden in large-scale deployments (Du et al., 2022, Sui et al., 2017).

Several open problems remain, including finite-time analyses for kernelized or dependent-arm models, robust multileaving methods minimizing distortion, and extensions to structured or non-stationary winner criteria (e.g., Borda or Copeland winners in the absence of a Condorcet winner) (Brost et al., 2016). The link between MDB and general winner-feedback bandits continues to attract attention, particularly for adversarial or non-stochastic settings (Gajane, 2024).

Markdown Report Issue Upgrade to Chat

References (4)

Dueling Bandits: From Two-dueling to Multi-dueling (2022)

Adversarial Multi-dueling Bandits (2024)

Multi-Dueling Bandits and Their Application to Online Ranker Evaluation (2016)

Multi-dueling Bandits with Dependent Arms (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Dueling Bandit (MDB).