D-Tracking Arm-Pulling Strategy

Updated 2 February 2026

The paper demonstrates that D-tracking optimally minimizes sample complexity and meets the minimax lower bound for fixed-confidence best-arm identification.
The methodology uses adaptive sampling with forced exploration and tracks optimal sampling proportions via likelihood ratio statistics to ensure rigorous error control.
The strategy significantly enhances sample efficiency in multi-armed bandits, providing strong theoretical guarantees and practical application across exponential family models.

The D-tracking arm-pulling strategy, also known as Track-and-Stop, is an asymptotically optimal approach for the fixed-confidence best-arm identification problem in stochastic multi-armed bandit models. The strategy efficiently balances exploration and exploitation by adaptively tracking optimal sampling proportions, guided by minimax lower bounds on sample complexity, and relies on a theoretically justified likelihood ratio-based stopping rule to ensure prescribed error probabilities. It has become a foundational methodology for sample-efficient arm identification in multi-armed bandits and related sequential experimental design problems (Garivier et al., 2016).

1. Problem Formulation and Objectives

The fixed-confidence best-arm identification framework considers $K$ arms, each governed by an unknown distribution $\nu_a$ from a one-parameter exponential family with mean $\mu_a$ . At each time $t$ , a sampling rule selects an arm $A_t$ based on past observations, generating a reward $X_t$ . The algorithm terminates at a stopping time $\tau$ , outputting an estimated best arm $\hat{a}_\tau$ . Given a confidence parameter $\delta \in (0,1)$ , a strategy is called $\delta$ -PAC (Probably Approximately Correct) if $\mathbb{P}_\mu(\tau<\infty)=1$ and $\mathbb{P}_\mu(\hat{a}_\tau \neq \arg\max_a \mu_a) \leq \delta$ . The objective is to minimize sample complexity, i.e., $\mathbb{E}_\mu[\tau]$ , subject to the $\delta$ -PAC constraint.

2. Minimax Sample Complexity Lower Bound

A fundamental result is a tight minimax lower bound on expected sample complexity. Let $d(\mu, \lambda)$ denote the Kullback-Leibler (KL) divergence for the exponential family, and $\text{Alt}(\mu)$ the set of alternate parameter vectors that differ in the identity of the maximal mean. The characteristic time $T^*(\mu)$ is given by

$(T^*(\mu))^{-1} = \sup_{w \in \Sigma_K}\inf_{\lambda \in \text{Alt}(\mu)} \sum_{a=1}^K w_a d(\mu_a, \lambda_a)$

where $\Sigma_K = \{w \in \mathbb{R}_+^K: \sum_a w_a = 1\}$ . For any $\delta$ -PAC algorithm,

$\mathbb{E}_\mu[\tau] \geq T^*(\mu) \cdot \text{kl}(\delta, 1-\delta)$

where $\text{kl}(\delta,1-\delta) \approx \log(1/\delta)$ for small $\delta$ (Garivier et al., 2016).

3. Optimal Sampling Proportions and Characterization

Assuming w.l.o.g. $\mu_1 > \mu_2 \geq \cdots \geq \mu_K$ , the optimal allocation vector $w^*$ maximizing the lower bound is characterized by

$\min_{a \neq 1} f_a(w), \text{ where } f_a(w) = (w_1 + w_a) I_{w_1/(w_1+w_a)}(\mu_1, \mu_a)$

and $I_\alpha(x, y) = \alpha d(x, \alpha x + (1 - \alpha) y) + (1 - \alpha) d(y, \alpha x + (1 - \alpha) y)$ . The unique $w^*$ equates these minima across all $a \neq 1$ . Efficient calculation of $w^*$ reduces to root-finding for a 1-dimensional function $F(y)$ related to the derivative of $f_a$ (Garivier et al., 2016).

4. D-Tracking Sampling Rule

The D-tracking sampling rule maintains two phases: forced exploration and proportion tracking. Forced exploration ensures that each arm is sampled at least $O(\sqrt{t})$ times to guarantee convergence of empirical means, addressing early-stage uncertainty. Once forced exploration requirements are met, the algorithm tracks the estimated optimal proportions:

Compute empirical means $\hat{\mu}_a$ .
Solve for $w^*(\hat{\mu})$ as above.
Pull at each round the arm maximizing $t \cdot w^*_a(\hat{\mu}) - N_a$ , where $N_a$ is the number of times arm $a$ has been pulled. This scheme ensures that, almost surely, arm pulls concentrate around $w^*(\mu)$ as $t \to \infty$ (Garivier et al., 2016).

5. Likelihood Ratio and Stopping Rule

The stopping rule employs a generalized likelihood ratio statistic for each pair $(a, b)$ : $Z_{ab}(t) = \log\frac{\sup_{\mu'_a \geq \mu'_b} L_a(t;\mu'_a)L_b(t;\mu'_b)}{\sup_{\mu'_a \leq \mu'_b} L_a(t;\mu'_a)L_b(t;\mu'_b)}$ where $L_a(t; \mu)$ is the likelihood of the first $N_a$ samples from arm $a$ given mean $\mu$ . In exponential families, if $\hat{\mu}_a > \hat{\mu}_b$ ,

$Z_{ab}(t) = N_a d(\hat{\mu}_a, \hat{\mu}_{ab}) + N_b d(\hat{\mu}_b, \hat{\mu}_{ab})$

where $\hat{\mu}_{ab}$ is the pooled mean. The stopping time is

$\tau_n = \inf\left\{ t : \exists a\ \forall b \neq a,\ Z_{ab}(t) > \beta(t,\delta)\right\}$

with a threshold $\beta(t, \delta) = \log(2t(K-1)/\delta)$ . Upon stopping, the algorithm recommends the arm with the largest empirical mean. This rule ensures the $\delta$ -PAC property holds for any sampling procedure (Garivier et al., 2016).

6. Asymptotic Optimality and Theoretical Guarantees

The Track-and-Stop (D-tracking) algorithm, combining D-tracking sampling with the above stopping rule, achieves asymptotic sample complexity matching the lower bound: $\frac{\mathbb{E}_\mu[\tau]}{\log(1/\delta)} \to T^*(\mu)$ as $\delta \to 0$ . Theoretical guarantees arise due to the design: forced exploration ensures statistical consistency, while likelihood ratio thresholds provide tight error control. Practical variants, such as the Best-Challenger variant that considers only the empirical champion and its strongest rival, offer computational efficiency without sacrificing asymptotic performance (Garivier et al., 2016).

7. Implementation Considerations and Practical Aspects

Implementation of D-tracking involves repeated root-finding for $w^*$ , efficiently accomplished using bisection or Newton methods. Forced exploration can be implemented with various sublinear schedules, and the likelihood ratio computations are numerically stable due to their reliance on empirical means and cumulative counts. Open-source Julia code for the strategy is available. Threshold tuning can yield practical speed-ups at negligible risk to PAC guarantees. The modular structure enables adaptation to settings such as best-m arm identification, adversarial bandits, and various exponential family reward models (Garivier et al., 2016).

Markdown Report Issue Upgrade to Chat

References (1)

Optimal Best Arm Identification with Fixed Confidence (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to D-Tracking Arm-Pulling Strategy.