Papers
Topics
Authors
Recent
Search
2000 character limit reached

Two-State Bernoulli Bandits

Updated 25 December 2025
  • Two-state Bernoulli bandits are a decision-theoretic model with two arms generating binary rewards using unknown parameters, which exemplifies the exploration–exploitation tradeoff.
  • The framework employs Bayesian inference, Beta priors, and dynamic programming to update beliefs and optimize cumulative rewards over finite or infinite horizons.
  • Analyses cover regret bounds across different gap regimes and explore extensions like streaming and dynamic bandits to enhance both theoretical insights and practical applications.

A two-state Bernoulli bandit is a canonical stochastic decision problem consisting of two arms (actions), each generating i.i.d. Bernoulli rewards with unknown parameters. At each round, a decision-maker selects one arm to pull, observes a binary reward, and aims to optimize a cumulative objective—such as total expected reward or minimal regret—over a finite or infinite horizon. This setting is used extensively to formalize and study fundamental exploration–exploitation tradeoffs in sequential learning, with direct relevance to statistics, decision theory, reinforcement learning, and information theory.

1. Mathematical Formulation and Bayesian Principles

A two-state Bernoulli bandit specifies two arms i=1,2i = 1, 2, with unknown success probabilities pi(0,1)p_i \in (0,1). Pulling arm ii at stage tt yields reward Zt{0,1}Z_t \in \{0,1\} distributed as Bernoulli(pi)\mathrm{Bernoulli}(p_i). The Bayesian approach imposes independent conjugate Beta priors piBeta(αi,βi)p_i \sim \mathrm{Beta}(\alpha_i, \beta_i) for each arm; the system's state can be represented as (α1,β1;α2,β2)(\alpha_1, \beta_1; \alpha_2, \beta_2).

The objective over a horizon nn with discount factor 0<δ10 < \delta \leq 1 is to maximize

Eπ[t=1nδt1Zt]E^\pi \Big[\sum_{t=1}^n \delta^{t-1} Z_t \Big]

where π\pi is a sequential allocation policy based on observed data. Posterior updates for arm ii after pulling and observing x{0,1}x \in \{0,1\} are (αi,βi)(αi+x,βi+1x)(\alpha_i, \beta_i) \mapsto (\alpha_i + x, \beta_i + 1 - x), while the other arm's parameters remain unchanged.

The value function Vn(α1,β1;α2,β2)V_n(\alpha_1, \beta_1; \alpha_2, \beta_2) describes the maximal expected discounted payoff from state (α1,β1;α2,β2)(\alpha_1, \beta_1; \alpha_2, \beta_2), recursively: V(α1,β1;α2,β2)=maxi=1,2Qi(α1,β1;α2,β2)V(\alpha_1, \beta_1; \alpha_2, \beta_2) = \max_{i=1,2} Q_i(\alpha_1, \beta_1; \alpha_2, \beta_2) with action-values

Q1=α1α1+β1+δ[α1α1+β1V(α1+1,β1;α2,β2)+β1α1+β1V(α1,β1+1;α2,β2)]Q_1 = \frac{\alpha_1}{\alpha_1+\beta_1} + \delta \left[ \frac{\alpha_1}{\alpha_1+\beta_1} V(\alpha_1+1, \beta_1; \alpha_2, \beta_2) + \frac{\beta_1}{\alpha_1+\beta_1} V(\alpha_1, \beta_1+1; \alpha_2, \beta_2) \right]

and a similar formula for Q2Q_2 by symmetry (Yu, 2011, Jacko, 2019).

2. Structure of Optimal Policies and Index Rules

The infinite-horizon discounted problem admits a remarkable structure: the optimal policy is an index rule based on the Gittins index g(α,β;δ)g(\alpha, \beta; \delta) for each arm. This index is the unique solution xx to

x=αα+β+δ[αα+βg(α+1,β;δ)+βα+βg(α,β+1;δ)]x = \frac{\alpha}{\alpha+\beta} + \delta \left[ \frac{\alpha}{\alpha+\beta} g(\alpha+1, \beta; \delta) + \frac{\beta}{\alpha+\beta} g(\alpha, \beta+1; \delta) \right]

Each epoch, the arm with higher index is pulled. This result is a consequence of two monotonicity theorems:

  • Monotonicity in prior mean: At fixed prior weight TT, higher prior mean μ=α/(α+β)\mu = \alpha/(\alpha+\beta) makes the arm more attractive; VV increases with μ\mu.
  • Monotonicity in prior weight: At fixed mean, greater prior weight TT (i.e., more data or less uncertainty) makes the arm less attractive; VV decreases with TT (Yu, 2011).

In the finite-horizon setting, exact dynamic programming (DP) yields the Bayes-optimal policy via backward induction on the joint state (α1,β1;α2,β2)(\alpha_1, \beta_1; \alpha_2, \beta_2). The computational complexity is O(T4)O(T^4), where TT is the horizon, yet for two arms this is feasible for large TT on modern hardware (e.g., T1,440T \approx 1,440 offline, T4,440T \approx 4,440 online in seconds) (Jacko, 2019).

3. Regret Analysis and Asymptotics

In the symmetric Bernoulli bandit (p1+p2=1p_1 + p_2 = 1), minimax regret analysis has been associated with the solution of a linear heat equation. The regret RTR_T and pseudoregret RˉT\bar{R}_T over horizon TT obey sharp asymptotics determined by the gap Δ=p1p2\Delta = |p_1 - p_2|:

  • Small-gap regime (ΔT1/2\Delta \ll T^{-1/2}):

RT(1/π)T0.564TR_T^* \sim (1/\sqrt{\pi}) \sqrt{T} \approx 0.564 \sqrt{T}

Pseudoregret: RˉTΔT\bar{R}_T^* \sim \Delta T

  • Medium-gap regime (Δ=γ/T\Delta = \gamma / \sqrt{T}):

RTc(γ)TR_T^* \sim c(\gamma)\sqrt{T}

with explicit c(γ)c(\gamma) as a function of γ\gamma (Kobzar et al., 2022).

  • Large-gap regime (ΔT1/2\Delta \gg T^{-1/2}):

RT1/Δ,RˉT1/ΔR_T^* \sim 1/\Delta,\quad \bar{R}_T^* \sim 1/\Delta

until O(1)O(1) regret saturation at fixed Δ\Delta.

Non-asymptotic upper and lower bounds originate from viewing the DP value recursion as a finite-difference approximation to the heat equation; discretization error is O(1+Δ2T)O(1+\Delta^2 T). This approach yields explicit leading-order regret rates across all regimes (Kobzar et al., 2022).

4. Algorithmic Approaches and Benchmarks

Bayes-optimal DP remains the gold standard for moderate horizons, though various heuristic and index-based algorithms are widely analyzed:

  • Gittins Index (infinite discounted horizon): Optimal, as previously discussed.
  • Whittle Index (finite horizon): Used as an approximation; requires horizon-dependent truncation.
  • Thompson Sampling: At each time, sample (θ1,θ2)(\theta_1, \theta_2) from the Beta posteriors and play argmaxiθi\arg\max_i \theta_i. Empirically strong but lacks matching regret guarantees in finite horizons.
  • Optimistic UCB-style algorithms: Compute UCBi(t)=θ^i+αln(t+1)/niUCB_i(t) = \hat{\theta}_i + \sqrt{\alpha \ln(t+1) / n_i}, α>0\alpha > 0. Classical α=2\alpha=2 is substantially suboptimal, even with tuning.
  • Hybrid heuristics: e.g., BLFF+BM and BLFF+0.18-UCB achieve within 10 ⁣ ⁣20%10\!-\!20\% of Bayes-optimal DP (Jacko, 2019).
  • OFUGLB (Optimistic Frequentist Upper-bound for Generalized Linear Bandits): Constructs a likelihood-ratio confidence sequence for each arm and pulls the arm with the highest upper confidence bound on success probability. With high probability,

RT=O(KTlog(ST/K)+K2logT)R_T = O\left(K \sqrt{T \log (S T/K)} + K^2 \log T\right)

for two-state Bernoulli bandits, matching the optimal UCB rates up to lower-order terms and avoiding polynomial dependence on SS (the constraint on the logistic parameter) (Lee et al., 2024).

For moderate TT, exact DP achieves near-constant regret; UCB with standard α\alpha can be 712×7-12\times worse than DP, while tuned UCB still incurs 2×\sim 2\times higher regret (Jacko, 2019).

5. Exploration–Exploitation Dilemma and Information-Theoretic Policies

The Bayesian formalism yields a rigorous understanding of exploration–exploitation. Monotonicity theorems imply that higher prior mean is inherently more attractive (exploitation), but at equal mean, lower prior weight (greater uncertainty) confers more value (exploration incentive) (Yu, 2011). This quantifies the exploration bonus analytically.

Information-directed sampling (IDS) policies formalize an explicit trade-off between one-step regret and information gain (reduction in posterior entropy). For the symmetric two-state Bernoulli bandit, the IDS policy coincides with the myopic posterior mean-maximizing rule and achieves bounded cumulative regret. In more challenging settings (e.g., one fair coin and one biased coin), IDS achieves Θ(logT)\Theta(\log T) regret, matching the Lai–Robbins lower bound (Hirling et al., 23 Dec 2025). The IDS framework introduces a tuning parameter α\alpha to interpolate between exploitation and exploration: Ψα(π,b)=Δ(π,b)1/αIπ(b)1/α1\Psi_\alpha(\pi, b) = \frac{\Delta(\pi, b)^{1/\alpha}}{I_\pi(b)^{1/\alpha - 1}} where Δ(π,b)\Delta(\pi, b) is expected regret, and Iπ(b)I_\pi(b) the expected information gain.

6. Generalizations and Variants

Several extensions modify the canonical model:

  • Streaming (Online) Bernoulli Bandit: Each bandit (arm) is encountered exactly once in a stream and, if skipped, cannot be revisited. Threshold-based "skip or stay" policies emerge as nearly optimal, with per-pull expected loss decaying polynomially in pool size NN (not K\sqrt{K} as in revisitable MABs). The classical trade-off disappears: exploration is conducted via "skipping" rather than repeated sampling (Roy et al., 2017).
  • Dynamic Bernoulli Bandits: Each arm's reward distribution evolves as a two-state Markov chain between high and low success probabilities. Adaptive Forgetting Factor (AFF) algorithms (AFF-dd-Greedy, AFF-UCB, AFF-TS) discount old observations using a learnable parameter, improving performance over classic algorithms in environments with changing means. Empirically, AFF-based Thompson sampling achieves the best simulated regret under both slow and fast switching (Lu et al., 2017).
  • Frequentist vs Bayesian Optimality: Bayes-optimal DP is only optimal with respect to the chosen prior; it is not minimax-optimal for fixed parameters, and heuristic rules may outperform DP for some configurations (Jacko, 2019). The Gittins policy does not ensure complete learning in all settings; finite-horizon limits circumvent this issue.

7. Empirical, Computational, and Practical Considerations

Modern implementations can solve the two-state Bernoulli bandit optimally via DP for horizons up to thousands in practical time and memory (e.g., BinaryBandit package in Julia) (Jacko, 2019). Empirical benchmarks confirm that many heuristics under-explore or suffer 2–10× increased regret compared to DP. Efficient index computation (Gittins, Whittle) reduces dimensionality from the full joint state to per-arm subproblems, but nontrivial dynamic programming is still required; closed-form indices are unavailable (Yu, 2011).

Classic myths—such as DP intractability, universal optimality of UCB, or inevitable logarithmic regret growth—are explicitly addressed and refuted in the recent literature. Optimal, near-optimal, and robust algorithmic options now exist across stochastic, adversarial, and dynamic two-state Bernoulli bandit scenarios (Jacko, 2019, Hirling et al., 23 Dec 2025, Lee et al., 2024).


Key References:

  • "Structural Properties of Bayesian Bandits with Exponential Family Distributions" (Yu, 2011)
  • "The Finite-Horizon Two-Armed Bandit Problem with Binary Responses" (Jacko, 2019)
  • "A PDE-Based Analysis of the Symmetric Two-Armed Bernoulli Bandit" (Kobzar et al., 2022)
  • "Online Multi-Armed Bandit" (Roy et al., 2017)
  • "On Adaptive Estimation for Dynamic Bernoulli Bandits" (Lu et al., 2017)
  • "Information-directed sampling for bandits: a primer" (Hirling et al., 23 Dec 2025)
  • "A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits" (Lee et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Two-State Bernoulli Bandits.