Papers
Topics
Authors
Recent
Search
2000 character limit reached

D-Tracking Arm-Pulling Strategy

Updated 2 February 2026
  • The paper demonstrates that D-tracking optimally minimizes sample complexity and meets the minimax lower bound for fixed-confidence best-arm identification.
  • The methodology uses adaptive sampling with forced exploration and tracks optimal sampling proportions via likelihood ratio statistics to ensure rigorous error control.
  • The strategy significantly enhances sample efficiency in multi-armed bandits, providing strong theoretical guarantees and practical application across exponential family models.

The D-tracking arm-pulling strategy, also known as Track-and-Stop, is an asymptotically optimal approach for the fixed-confidence best-arm identification problem in stochastic multi-armed bandit models. The strategy efficiently balances exploration and exploitation by adaptively tracking optimal sampling proportions, guided by minimax lower bounds on sample complexity, and relies on a theoretically justified likelihood ratio-based stopping rule to ensure prescribed error probabilities. It has become a foundational methodology for sample-efficient arm identification in multi-armed bandits and related sequential experimental design problems (Garivier et al., 2016).

1. Problem Formulation and Objectives

The fixed-confidence best-arm identification framework considers KK arms, each governed by an unknown distribution νa\nu_a from a one-parameter exponential family with mean μa\mu_a. At each time tt, a sampling rule selects an arm AtA_t based on past observations, generating a reward XtX_t. The algorithm terminates at a stopping time τ\tau, outputting an estimated best arm a^τ\hat{a}_\tau. Given a confidence parameter δ(0,1)\delta \in (0,1), a strategy is called δ\delta-PAC (Probably Approximately Correct) if Pμ(τ<)=1\mathbb{P}_\mu(\tau<\infty)=1 and Pμ(a^τargmaxaμa)δ\mathbb{P}_\mu(\hat{a}_\tau \neq \arg\max_a \mu_a) \leq \delta. The objective is to minimize sample complexity, i.e., Eμ[τ]\mathbb{E}_\mu[\tau], subject to the δ\delta-PAC constraint.

2. Minimax Sample Complexity Lower Bound

A fundamental result is a tight minimax lower bound on expected sample complexity. Let d(μ,λ)d(\mu, \lambda) denote the Kullback-Leibler (KL) divergence for the exponential family, and Alt(μ)\text{Alt}(\mu) the set of alternate parameter vectors that differ in the identity of the maximal mean. The characteristic time T(μ)T^*(\mu) is given by

(T(μ))1=supwΣKinfλAlt(μ)a=1Kwad(μa,λa)(T^*(\mu))^{-1} = \sup_{w \in \Sigma_K}\inf_{\lambda \in \text{Alt}(\mu)} \sum_{a=1}^K w_a d(\mu_a, \lambda_a)

where ΣK={wR+K:awa=1}\Sigma_K = \{w \in \mathbb{R}_+^K: \sum_a w_a = 1\}. For any δ\delta-PAC algorithm,

Eμ[τ]T(μ)kl(δ,1δ)\mathbb{E}_\mu[\tau] \geq T^*(\mu) \cdot \text{kl}(\delta, 1-\delta)

where kl(δ,1δ)log(1/δ)\text{kl}(\delta,1-\delta) \approx \log(1/\delta) for small δ\delta (Garivier et al., 2016).

3. Optimal Sampling Proportions and Characterization

Assuming w.l.o.g. μ1>μ2μK\mu_1 > \mu_2 \geq \cdots \geq \mu_K, the optimal allocation vector ww^* maximizing the lower bound is characterized by

mina1fa(w), where fa(w)=(w1+wa)Iw1/(w1+wa)(μ1,μa)\min_{a \neq 1} f_a(w), \text{ where } f_a(w) = (w_1 + w_a) I_{w_1/(w_1+w_a)}(\mu_1, \mu_a)

and Iα(x,y)=αd(x,αx+(1α)y)+(1α)d(y,αx+(1α)y)I_\alpha(x, y) = \alpha d(x, \alpha x + (1 - \alpha) y) + (1 - \alpha) d(y, \alpha x + (1 - \alpha) y). The unique ww^* equates these minima across all a1a \neq 1. Efficient calculation of ww^* reduces to root-finding for a 1-dimensional function F(y)F(y) related to the derivative of faf_a (Garivier et al., 2016).

4. D-Tracking Sampling Rule

The D-tracking sampling rule maintains two phases: forced exploration and proportion tracking. Forced exploration ensures that each arm is sampled at least O(t)O(\sqrt{t}) times to guarantee convergence of empirical means, addressing early-stage uncertainty. Once forced exploration requirements are met, the algorithm tracks the estimated optimal proportions:

  • Compute empirical means μ^a\hat{\mu}_a.
  • Solve for w(μ^)w^*(\hat{\mu}) as above.
  • Pull at each round the arm maximizing twa(μ^)Nat \cdot w^*_a(\hat{\mu}) - N_a, where NaN_a is the number of times arm aa has been pulled. This scheme ensures that, almost surely, arm pulls concentrate around w(μ)w^*(\mu) as tt \to \infty (Garivier et al., 2016).

5. Likelihood Ratio and Stopping Rule

The stopping rule employs a generalized likelihood ratio statistic for each pair (a,b)(a, b): Zab(t)=logsupμaμbLa(t;μa)Lb(t;μb)supμaμbLa(t;μa)Lb(t;μb)Z_{ab}(t) = \log\frac{\sup_{\mu'_a \geq \mu'_b} L_a(t;\mu'_a)L_b(t;\mu'_b)}{\sup_{\mu'_a \leq \mu'_b} L_a(t;\mu'_a)L_b(t;\mu'_b)} where La(t;μ)L_a(t; \mu) is the likelihood of the first NaN_a samples from arm aa given mean μ\mu. In exponential families, if μ^a>μ^b\hat{\mu}_a > \hat{\mu}_b,

Zab(t)=Nad(μ^a,μ^ab)+Nbd(μ^b,μ^ab)Z_{ab}(t) = N_a d(\hat{\mu}_a, \hat{\mu}_{ab}) + N_b d(\hat{\mu}_b, \hat{\mu}_{ab})

where μ^ab\hat{\mu}_{ab} is the pooled mean. The stopping time is

τn=inf{t:a ba, Zab(t)>β(t,δ)}\tau_n = \inf\left\{ t : \exists a\ \forall b \neq a,\ Z_{ab}(t) > \beta(t,\delta)\right\}

with a threshold β(t,δ)=log(2t(K1)/δ)\beta(t, \delta) = \log(2t(K-1)/\delta). Upon stopping, the algorithm recommends the arm with the largest empirical mean. This rule ensures the δ\delta-PAC property holds for any sampling procedure (Garivier et al., 2016).

6. Asymptotic Optimality and Theoretical Guarantees

The Track-and-Stop (D-tracking) algorithm, combining D-tracking sampling with the above stopping rule, achieves asymptotic sample complexity matching the lower bound: Eμ[τ]log(1/δ)T(μ)\frac{\mathbb{E}_\mu[\tau]}{\log(1/\delta)} \to T^*(\mu) as δ0\delta \to 0. Theoretical guarantees arise due to the design: forced exploration ensures statistical consistency, while likelihood ratio thresholds provide tight error control. Practical variants, such as the Best-Challenger variant that considers only the empirical champion and its strongest rival, offer computational efficiency without sacrificing asymptotic performance (Garivier et al., 2016).

7. Implementation Considerations and Practical Aspects

Implementation of D-tracking involves repeated root-finding for ww^*, efficiently accomplished using bisection or Newton methods. Forced exploration can be implemented with various sublinear schedules, and the likelihood ratio computations are numerically stable due to their reliance on empirical means and cumulative counts. Open-source Julia code for the strategy is available. Threshold tuning can yield practical speed-ups at negligible risk to PAC guarantees. The modular structure enables adaptation to settings such as best-m arm identification, adversarial bandits, and various exponential family reward models (Garivier et al., 2016).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to D-Tracking Arm-Pulling Strategy.