D-Tracking Arm-Pulling Strategy
- The paper demonstrates that D-tracking optimally minimizes sample complexity and meets the minimax lower bound for fixed-confidence best-arm identification.
- The methodology uses adaptive sampling with forced exploration and tracks optimal sampling proportions via likelihood ratio statistics to ensure rigorous error control.
- The strategy significantly enhances sample efficiency in multi-armed bandits, providing strong theoretical guarantees and practical application across exponential family models.
The D-tracking arm-pulling strategy, also known as Track-and-Stop, is an asymptotically optimal approach for the fixed-confidence best-arm identification problem in stochastic multi-armed bandit models. The strategy efficiently balances exploration and exploitation by adaptively tracking optimal sampling proportions, guided by minimax lower bounds on sample complexity, and relies on a theoretically justified likelihood ratio-based stopping rule to ensure prescribed error probabilities. It has become a foundational methodology for sample-efficient arm identification in multi-armed bandits and related sequential experimental design problems (Garivier et al., 2016).
1. Problem Formulation and Objectives
The fixed-confidence best-arm identification framework considers arms, each governed by an unknown distribution from a one-parameter exponential family with mean . At each time , a sampling rule selects an arm based on past observations, generating a reward . The algorithm terminates at a stopping time , outputting an estimated best arm . Given a confidence parameter , a strategy is called -PAC (Probably Approximately Correct) if and . The objective is to minimize sample complexity, i.e., , subject to the -PAC constraint.
2. Minimax Sample Complexity Lower Bound
A fundamental result is a tight minimax lower bound on expected sample complexity. Let denote the Kullback-Leibler (KL) divergence for the exponential family, and the set of alternate parameter vectors that differ in the identity of the maximal mean. The characteristic time is given by
where . For any -PAC algorithm,
where for small (Garivier et al., 2016).
3. Optimal Sampling Proportions and Characterization
Assuming w.l.o.g. , the optimal allocation vector maximizing the lower bound is characterized by
and . The unique equates these minima across all . Efficient calculation of reduces to root-finding for a 1-dimensional function related to the derivative of (Garivier et al., 2016).
4. D-Tracking Sampling Rule
The D-tracking sampling rule maintains two phases: forced exploration and proportion tracking. Forced exploration ensures that each arm is sampled at least times to guarantee convergence of empirical means, addressing early-stage uncertainty. Once forced exploration requirements are met, the algorithm tracks the estimated optimal proportions:
- Compute empirical means .
- Solve for as above.
- Pull at each round the arm maximizing , where is the number of times arm has been pulled. This scheme ensures that, almost surely, arm pulls concentrate around as (Garivier et al., 2016).
5. Likelihood Ratio and Stopping Rule
The stopping rule employs a generalized likelihood ratio statistic for each pair : where is the likelihood of the first samples from arm given mean . In exponential families, if ,
where is the pooled mean. The stopping time is
with a threshold . Upon stopping, the algorithm recommends the arm with the largest empirical mean. This rule ensures the -PAC property holds for any sampling procedure (Garivier et al., 2016).
6. Asymptotic Optimality and Theoretical Guarantees
The Track-and-Stop (D-tracking) algorithm, combining D-tracking sampling with the above stopping rule, achieves asymptotic sample complexity matching the lower bound: as . Theoretical guarantees arise due to the design: forced exploration ensures statistical consistency, while likelihood ratio thresholds provide tight error control. Practical variants, such as the Best-Challenger variant that considers only the empirical champion and its strongest rival, offer computational efficiency without sacrificing asymptotic performance (Garivier et al., 2016).
7. Implementation Considerations and Practical Aspects
Implementation of D-tracking involves repeated root-finding for , efficiently accomplished using bisection or Newton methods. Forced exploration can be implemented with various sublinear schedules, and the likelihood ratio computations are numerically stable due to their reliance on empirical means and cumulative counts. Open-source Julia code for the strategy is available. Threshold tuning can yield practical speed-ups at negligible risk to PAC guarantees. The modular structure enables adaptation to settings such as best-m arm identification, adversarial bandits, and various exponential family reward models (Garivier et al., 2016).