Two-State Bernoulli Bandits

Updated 25 December 2025

Two-state Bernoulli bandits are a decision-theoretic model with two arms generating binary rewards using unknown parameters, which exemplifies the exploration–exploitation tradeoff.
The framework employs Bayesian inference, Beta priors, and dynamic programming to update beliefs and optimize cumulative rewards over finite or infinite horizons.
Analyses cover regret bounds across different gap regimes and explore extensions like streaming and dynamic bandits to enhance both theoretical insights and practical applications.

A two-state Bernoulli bandit is a canonical stochastic decision problem consisting of two arms (actions), each generating i.i.d. Bernoulli rewards with unknown parameters. At each round, a decision-maker selects one arm to pull, observes a binary reward, and aims to optimize a cumulative objective—such as total expected reward or minimal regret—over a finite or infinite horizon. This setting is used extensively to formalize and study fundamental exploration–exploitation tradeoffs in sequential learning, with direct relevance to statistics, decision theory, reinforcement learning, and information theory.

1. Mathematical Formulation and Bayesian Principles

A two-state Bernoulli bandit specifies two arms $i = 1, 2$ , with unknown success probabilities $p_i \in (0,1)$ . Pulling arm $i$ at stage $t$ yields reward $Z_t \in \{0,1\}$ distributed as $\mathrm{Bernoulli}(p_i)$ . The Bayesian approach imposes independent conjugate Beta priors $p_i \sim \mathrm{Beta}(\alpha_i, \beta_i)$ for each arm; the system's state can be represented as $(\alpha_1, \beta_1; \alpha_2, \beta_2)$ .

The objective over a horizon $n$ with discount factor $0 < \delta \leq 1$ is to maximize

$E^\pi \Big[\sum_{t=1}^n \delta^{t-1} Z_t \Big]$

where $\pi$ is a sequential allocation policy based on observed data. Posterior updates for arm $i$ after pulling and observing $x \in \{0,1\}$ are $(\alpha_i, \beta_i) \mapsto (\alpha_i + x, \beta_i + 1 - x)$ , while the other arm's parameters remain unchanged.

The value function $V_n(\alpha_1, \beta_1; \alpha_2, \beta_2)$ describes the maximal expected discounted payoff from state $(\alpha_1, \beta_1; \alpha_2, \beta_2)$ , recursively: $V(\alpha_1, \beta_1; \alpha_2, \beta_2) = \max_{i=1,2} Q_i(\alpha_1, \beta_1; \alpha_2, \beta_2)$ with action-values

$Q_1 = \frac{\alpha_1}{\alpha_1+\beta_1} + \delta \left[ \frac{\alpha_1}{\alpha_1+\beta_1} V(\alpha_1+1, \beta_1; \alpha_2, \beta_2) + \frac{\beta_1}{\alpha_1+\beta_1} V(\alpha_1, \beta_1+1; \alpha_2, \beta_2) \right]$

and a similar formula for $Q_2$ by symmetry (Yu, 2011, Jacko, 2019).

2. Structure of Optimal Policies and Index Rules

The infinite-horizon discounted problem admits a remarkable structure: the optimal policy is an index rule based on the Gittins index $g(\alpha, \beta; \delta)$ for each arm. This index is the unique solution $x$ to

$x = \frac{\alpha}{\alpha+\beta} + \delta \left[ \frac{\alpha}{\alpha+\beta} g(\alpha+1, \beta; \delta) + \frac{\beta}{\alpha+\beta} g(\alpha, \beta+1; \delta) \right]$

Each epoch, the arm with higher index is pulled. This result is a consequence of two monotonicity theorems:

Monotonicity in prior mean: At fixed prior weight $T$ , higher prior mean $\mu = \alpha/(\alpha+\beta)$ makes the arm more attractive; $V$ increases with $\mu$ .
Monotonicity in prior weight: At fixed mean, greater prior weight $T$ (i.e., more data or less uncertainty) makes the arm less attractive; $V$ decreases with $T$ (Yu, 2011).

In the finite-horizon setting, exact dynamic programming (DP) yields the Bayes-optimal policy via backward induction on the joint state $(\alpha_1, \beta_1; \alpha_2, \beta_2)$ . The computational complexity is $O(T^4)$ , where $T$ is the horizon, yet for two arms this is feasible for large $T$ on modern hardware (e.g., $T \approx 1,440$ offline, $T \approx 4,440$ online in seconds) (Jacko, 2019).

3. Regret Analysis and Asymptotics

In the symmetric Bernoulli bandit ( $p_1 + p_2 = 1$ ), minimax regret analysis has been associated with the solution of a linear heat equation. The regret $R_T$ and pseudoregret $\bar{R}_T$ over horizon $T$ obey sharp asymptotics determined by the gap $\Delta = |p_1 - p_2|$ :

Small-gap regime ( $\Delta \ll T^{-1/2}$ ):

$R_T^* \sim (1/\sqrt{\pi}) \sqrt{T} \approx 0.564 \sqrt{T}$

Pseudoregret: $\bar{R}_T^* \sim \Delta T$

Medium-gap regime ( $\Delta = \gamma / \sqrt{T}$ ):

$R_T^* \sim c(\gamma)\sqrt{T}$

with explicit $c(\gamma)$ as a function of $\gamma$ (Kobzar et al., 2022).

Large-gap regime ( $\Delta \gg T^{-1/2}$ ):

$R_T^* \sim 1/\Delta,\quad \bar{R}_T^* \sim 1/\Delta$

until $O(1)$ regret saturation at fixed $\Delta$ .

Non-asymptotic upper and lower bounds originate from viewing the DP value recursion as a finite-difference approximation to the heat equation; discretization error is $O(1+\Delta^2 T)$ . This approach yields explicit leading-order regret rates across all regimes (Kobzar et al., 2022).

4. Algorithmic Approaches and Benchmarks

Bayes-optimal DP remains the gold standard for moderate horizons, though various heuristic and index-based algorithms are widely analyzed:

Gittins Index (infinite discounted horizon): Optimal, as previously discussed.
Whittle Index (finite horizon): Used as an approximation; requires horizon-dependent truncation.
Thompson Sampling: At each time, sample $(\theta_1, \theta_2)$ from the Beta posteriors and play $\arg\max_i \theta_i$ . Empirically strong but lacks matching regret guarantees in finite horizons.
Optimistic UCB-style algorithms: Compute $UCB_i(t) = \hat{\theta}_i + \sqrt{\alpha \ln(t+1) / n_i}$ , $\alpha > 0$ . Classical $\alpha=2$ is substantially suboptimal, even with tuning.
Hybrid heuristics: e.g., BLFF+BM and BLFF+0.18-UCB achieve within $10\!-\!20\%$ of Bayes-optimal DP (Jacko, 2019).
OFUGLB (Optimistic Frequentist Upper-bound for Generalized Linear Bandits): Constructs a likelihood-ratio confidence sequence for each arm and pulls the arm with the highest upper confidence bound on success probability. With high probability,

$R_T = O\left(K \sqrt{T \log (S T/K)} + K^2 \log T\right)$

for two-state Bernoulli bandits, matching the optimal UCB rates up to lower-order terms and avoiding polynomial dependence on $S$ (the constraint on the logistic parameter) (Lee et al., 2024).

For moderate $T$ , exact DP achieves near-constant regret; UCB with standard $\alpha$ can be $7-12\times$ worse than DP, while tuned UCB still incurs $\sim 2\times$ higher regret (Jacko, 2019).

5. Exploration–Exploitation Dilemma and Information-Theoretic Policies

The Bayesian formalism yields a rigorous understanding of exploration–exploitation. Monotonicity theorems imply that higher prior mean is inherently more attractive (exploitation), but at equal mean, lower prior weight (greater uncertainty) confers more value (exploration incentive) (Yu, 2011). This quantifies the exploration bonus analytically.

Information-directed sampling (IDS) policies formalize an explicit trade-off between one-step regret and information gain (reduction in posterior entropy). For the symmetric two-state Bernoulli bandit, the IDS policy coincides with the myopic posterior mean-maximizing rule and achieves bounded cumulative regret. In more challenging settings (e.g., one fair coin and one biased coin), IDS achieves $\Theta(\log T)$ regret, matching the Lai–Robbins lower bound (Hirling et al., 23 Dec 2025). The IDS framework introduces a tuning parameter $\alpha$ to interpolate between exploitation and exploration: $\Psi_\alpha(\pi, b) = \frac{\Delta(\pi, b)^{1/\alpha}}{I_\pi(b)^{1/\alpha - 1}}$ where $\Delta(\pi, b)$ is expected regret, and $I_\pi(b)$ the expected information gain.

6. Generalizations and Variants

Several extensions modify the canonical model:

Streaming (Online) Bernoulli Bandit: Each bandit (arm) is encountered exactly once in a stream and, if skipped, cannot be revisited. Threshold-based "skip or stay" policies emerge as nearly optimal, with per-pull expected loss decaying polynomially in pool size $N$ (not $\sqrt{K}$ as in revisitable MABs). The classical trade-off disappears: exploration is conducted via "skipping" rather than repeated sampling (Roy et al., 2017).
Dynamic Bernoulli Bandits: Each arm's reward distribution evolves as a two-state Markov chain between high and low success probabilities. Adaptive Forgetting Factor (AFF) algorithms (AFF- $d$ -Greedy, AFF-UCB, AFF-TS) discount old observations using a learnable parameter, improving performance over classic algorithms in environments with changing means. Empirically, AFF-based Thompson sampling achieves the best simulated regret under both slow and fast switching (Lu et al., 2017).
Frequentist vs Bayesian Optimality: Bayes-optimal DP is only optimal with respect to the chosen prior; it is not minimax-optimal for fixed parameters, and heuristic rules may outperform DP for some configurations (Jacko, 2019). The Gittins policy does not ensure complete learning in all settings; finite-horizon limits circumvent this issue.

7. Empirical, Computational, and Practical Considerations

Modern implementations can solve the two-state Bernoulli bandit optimally via DP for horizons up to thousands in practical time and memory (e.g., BinaryBandit package in Julia) (Jacko, 2019). Empirical benchmarks confirm that many heuristics under-explore or suffer 2–10× increased regret compared to DP. Efficient index computation (Gittins, Whittle) reduces dimensionality from the full joint state to per-arm subproblems, but nontrivial dynamic programming is still required; closed-form indices are unavailable (Yu, 2011).

Classic myths—such as DP intractability, universal optimality of UCB, or inevitable logarithmic regret growth—are explicitly addressed and refuted in the recent literature. Optimal, near-optimal, and robust algorithmic options now exist across stochastic, adversarial, and dynamic two-state Bernoulli bandit scenarios (Jacko, 2019, Hirling et al., 23 Dec 2025, Lee et al., 2024).

Key References:

"Structural Properties of Bayesian Bandits with Exponential Family Distributions" (Yu, 2011)
"The Finite-Horizon Two-Armed Bandit Problem with Binary Responses" (Jacko, 2019)
"A PDE-Based Analysis of the Symmetric Two-Armed Bernoulli Bandit" (Kobzar et al., 2022)
"Online Multi-Armed Bandit" (Roy et al., 2017)
"On Adaptive Estimation for Dynamic Bernoulli Bandits" (Lu et al., 2017)
"Information-directed sampling for bandits: a primer" (Hirling et al., 23 Dec 2025)
"A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits" (Lee et al., 2024)