Dueling-Bandit Selection: Sparse and Efficient Methods
- Dueling-bandit selection is the process of identifying optimal items through noisy pairwise duels, emphasizing relative feedback over direct rewards.
- Algorithms employ winner criteria such as Borda, Copeland, and Condorcet, with methods like SECS exploiting sparsity to efficiently eliminate suboptimal choices.
- Theoretical guarantees and empirical evaluations demonstrate reduced sample complexity and robust performance in both synthetic simulations and real-world ranking tasks.
Dueling-bandit selection is the problem of sequentially identifying optimal or near-optimal items from a finite set based solely on noisy pairwise comparisons, rather than absolute reward signals. This setting is fundamental in interactive learning, online ranking, and preference elicitation, where only relative (comparative) feedback is available. The field covers sample-optimal pure-exploration and regret-minimization algorithms under a variety of winner criteria, including the Borda, Copeland, and Condorcet notions, with specialized methodologies leveraging structure such as sparsity or transitivity in the underlying preference matrix.
1. Problem Formulation and Winner Criteria
Consider arms, with each action a pairwise "duel" between distinct arms, yielding a random outcome , where , . The collection is called the preference matrix, generally unknown and fixed but possibly subject to regularity constraints such as for identification purposes (Jamieson et al., 2015).
Winner Criteria:
- Borda Winner: The unique arm maximizing the Borda score
is the Borda winner if for all .
- Copeland Winner: For each arm , define the Copeland score as the fraction of arms that beats (i.e., ), normalized by . The Copeland winner(s) maximize this score. Unlike the Condorcet winner, the Copeland winner always exists.
- Condorcet Winner: An arm such that for all . There may not always be a Condorcet winner, but a Borda or Copeland winner exists under mild assumptions (Zoghi et al., 2015).
The goal can be pure exploration (best-arm identification with high confidence and minimal samples) or regret minimization (minimize cumulative gap to optimality over time).
2. Sample Complexity and Lower Bounds
Borda Selection: Minimax Rates and Optimality
In the absence of special structure, achieving -PAC selection of the Borda winner requires
duels, where (Jamieson et al., 2015). The lower bound matches this up to constant and lower-order logarithmic terms:
Thus, in the generic (non-structured) case, the minimax sample complexity is .
Impact of Sparsity
A notable advance is the exploitation of "sparsity" in the comparison matrix. Let for each suboptimal arm (assuming arm 1 is the unique Borda winner). If, for each , is -approximately sparse—i.e., the sum of discriminative gaps outside the largest entries is at most times the main partial gap for the indices—then the leading term of in sample complexity can be reduced by a factor of (Jamieson et al., 2015). That is, only duels are needed per arm.
3. Algorithmic Approaches: SECS and Successive Elimination
The SECS Algorithm
Successive Elimination with Comparison Sparsity (SECS) augments classical elimination with a sparse test. At each round, the algorithm either:
- Applies standard Borda elimination: if the empirical Borda gap for is large enough, eliminate .
- Applies a sparse partial-gap test: evaluate empirical pairwise gaps on the most discriminative comparison arms for . If the partial gap exceeds a threshold, eliminate .
Pseudocode Outline [algorithmic core from (Jamieson et al., 2015)]:
- Input: sparsity , confidence , (optionally) a time-gate
- Maintain an active set ; at each time step:
- For each , sample duels against a random arm .
- For , eliminate if there exists such that:
- (a) and the partial empirical gap over the maximal discrepancy arms exceeds .
- (b) The full empirical gap in Borda estimates is large enough: .
Theoretical Guarantees
Under the sparsity condition (for, e.g., ), SECS identifies the Borda winner with high probability using at most
samples (Jamieson et al., 2015). For and moderate gaps, this provides a improvement in sample complexity relative to the non-sparse baseline.
If , SECS reduces to standard successive elimination (full Borda reduction), matching known minimax rates (Jamieson et al., 2015).
4. Empirical Evaluation and Practical Impact
Synthetic and Real-World Data
- Synthetic case (P1 matrix): For , with one top-2 distinguished group differing only on one other arm ( sparsity), SECS achieves sample complexity, versus for standard Borda reduction. Log-log slopes are $2$ (standard) vs.\ $1$ (SECS).
- Real data (MSLR-WEB10k, MQ2008): In web ranking settings (arms as rankers/features), SECS with small (e.g., ) reduced the number of necessary duels by approximately compared to standard algorithms, with performance smoothly degrading to the full Borda cost as . In all cases with unique Borda winners, SECS reliably identified the correct arm.
Practical Guidance
- When to use sparsity-based methods: If top candidates are only separated on a small number of discriminative arms/attributes, SECS yields substantial gains.
- Parameter tuning: Small (e.g., ) suffices for many real datasets; setting is often acceptable; the confidence as in any PAC algorithm.
- Otherwise: Use the standard Borda reduction—successive elimination based on Borda scores—if no structure is apparent.
5. Broader Connections and Related Developments
Extensions to Other Winner Criteria and Problem Structures
- Copeland and Condorcet settings: Algorithms such as Copeland Confidence Bound (CCB) and Scalable Copeland Bandits (SCB) generalize to broader winner criteria and retain regret without Condorcet assumptions (Zoghi et al., 2015).
- Multi-dueling and Battling Bandits: Recent frameworks allow comparing larger subsets at each timestep and exploit richer feedback; however, fundamental sample complexity for winner-of-subset feedback is not improved unless richer ranking information is available (Saha et al., 2018). Under pure winner-information feedback, the lower bound matches the ordinary dueling-bandit setting.
- Combinatorial Structures: In combinatorial pure exploration settings (e.g., best matching in a bipartite graph), Borda-type objectives admit reductions to CPE-MAB settings, and Condorcet objectives require min-max combinatorial strategies (Chen et al., 2020).
Algorithmic Summary Table
| Method | Structure Leveraged | Sample Complexity / Regret |
|---|---|---|
| Borda Reduction | None | |
| SECS (Sparse) | -sparsity | |
| CCB/SCB (Copeland) | None/minimal | for CCB/SCB under mild assumptions |
| PAC-Battling Bandits | PL/Subset structure | Winner info: , Top-m: |
6. Theoretical and Practical Implications
Sparsity-aware algorithms such as SECS highlight that structural properties of the preference matrix can be exploited for substantial reductions in sample requirements, resulting in near-optimal practical and theoretical performance in large-scale systems with moderate discriminative structure. In the absence of such structure, classical reduction to multi-armed bandit techniques yield minimax bounds. The study of dueling-bandit selection serves as both a methodological foundation and a unifying paradigm for relative-feedback online decision-making in a wide range of applications.
7. References
- "Sparse Dueling Bandits" (Jamieson et al., 2015)
- "Copeland Dueling Bandits" (Zoghi et al., 2015)
- "PAC Battling Bandits in the Plackett-Luce Model" (Saha et al., 2018)
- "Combinatorial Pure Exploration of Dueling Bandit" (Chen et al., 2020)
These works collectively establish both tight theoretical rates and practical guidelines for sample-efficient selection in preference-based online learning frameworks.