Information-Theoretic Bayesian Regret Bound

Updated 10 February 2026

Information-Theoretic Bayesian Regret Bound is a framework that quantifies the trade-off between information acquisition and decision quality in sequential learning problems.
It leverages entropy and mutual information to derive tight, often minimax optimal, regret bounds across settings such as bandits, reinforcement learning, and partial monitoring.
The approach informs modern algorithm design by unifying and sharpening performance limits through information measures, guiding trade-offs in exploration and exploitation.

An information-theoretic Bayesian regret bound characterizes the fundamental trade-off between information acquisition and decision quality in sequential decision-making problems under uncertainty, using information quantities such as entropy and mutual information. This approach yields tight, often minimax optimal, regret bounds for a wide range of settings—including stochastic bandits, partial monitoring, contextual bandits, reinforcement learning, and nonstationary/adaptive environments. The framework unifies and extends classical results and underlies the design of modern algorithms exploiting the exploration–information–regret interplay.

1. Fundamental Definitions and General Information-Ratio Bound

Let $A^*$ be the Bayes-optimal action under an environment prior, and let $A_t$ and $\mathcal{H}_t$ denote the action chosen and history up to round $t$ . The Bayesian (expected) cumulative regret for policy $\pi$ over $T$ rounds, with respect to prior $\nu$ on the environment, is

$\operatorname{BR}_T(\pi,\nu) = \mathbb{E}_{x\sim\nu}\Bigg[ \sum_{t=1}^T \big( r_t(A^*,x) - r_t(A_t,x) \big) \Bigg]$

Abstracting over prior $\nu$ yields the minimax or worst-prior Bayesian regret.

The canonical information-theoretic regret analysis, due to Russo and Van Roy, introduces the information ratio

$\Gamma_t = \frac{ (\mathbb{E}_t[\Delta_t])^2 }{ I_t(A^*; F_t) }$

where $\mathbb{E}_t[\Delta_t]$ is the conditional expected instantaneous regret, and $I_t(A^*; F_t)$ is the mutual information gained about the optimal action (or parameter) from the current feedback $F_t$ .

Summing over $T$ rounds and applying the Cauchy–Schwarz inequality and the chain rule yields the archetypal bound: $\operatorname{BR}_T \leq \sqrt{ \Gamma\, T\, H(A^*) }$ where $H(A^*)$ is the Shannon entropy of the optimal action under the prior, and $\Gamma \geq \max_t \mathbb{E}[\Gamma_t]$ . This result underpins a host of regret bounds in bandits, RL, and their generalizations (Dong et al., 2018, Lattimore et al., 2019, Neu et al., 2022).

2. Extensions: Rate-Distortion and Chaining

For large or continuous action spaces, the prior entropy $H(A^*)$ may be infinite. The information-theoretic analysis is then refined via rate-distortion theory (Dong et al., 2018, Gouverneur et al., 4 Feb 2025) or chaining (Gouverneur et al., 2024).

Compression/statistic-based bound: For an $\varepsilon$ -partitioning of actions or parameters with at most $K$ cells and corresponding coarsened optimal action $A^*_\varepsilon$ ,

$\operatorname{BR}_T \leq \varepsilon T + \sqrt{ \widetilde\Gamma\,T\,H(A^*_\varepsilon) }$

where $\widetilde\Gamma$ is the worst-case compressed information ratio. The entropy of $A^*_\varepsilon$ is determined by the metric entropy or covering number of the action/parameter set at scale $\varepsilon$ (Gouverneur et al., 4 Feb 2025).

Chaining: For linear bandits with a compact action set $\mathcal{A}\subset\mathbb{R}^d$ , a chaining argument telescopes regret over a hierarchy of covers: letting $N(\mathcal{A},\rho,\varepsilon)$ be the covering number, the bound becomes

$\operatorname{BR}_T \leq C \sqrt{d\,T}\int_0^{\operatorname{diam}(\mathcal{A})} \sqrt{ \log N(\mathcal{A}, \rho, \varepsilon) }\, d\varepsilon$

which yields the tight minimax $O(d\sqrt{T})$ regret for linear bandits (Gouverneur et al., 2024).

3. Generalizations: Beyond Bandits (Partial Monitoring, RL, Contextual Bandits)

The information-theoretic framework has been extended beyond basic bandit settings:

Partial Monitoring: Utilizing a generalized information-variance lemma and a minimax swap theorem, minimax regret in partial monitoring games is reduced to bounding Bayesian regret via information or Bregman-divergence criteria. Consequences include explicit constants for adversarial bandits, cops-and-robbers, and both “easy” and “hard” partial monitoring regimes (Lattimore et al., 2019).
Reinforcement Learning: For MDPs, information-theoretic bounds relate the minimum Bayesian regret (the gap between Bayes-optimal and learned performance) to the sum over time of divergences or Wasserstein distances between the true and simulated environment/posterior, yielding:

$\operatorname{MBR} \leq \sum_{t=1}^T \sqrt{2\sigma_t^2 D_{\mathrm{KL}}(P^* \Vert \hat{P}_t)}$

and similarly with $L\, W_1$ for Wasserstein distance under Lipschitz rewards. Specializations reproduce classical Russo–Van Roy bounds for Bayesian MAB and RL with partial/structured feedback (Gouverneur et al., 2022, Hao et al., 2022).

Contextual and Nonstationary Bandits: The relevant information term becomes the mutual information about the latent optimal policy or action, and the regret rate adapts to the entropy (or entropy rate) of that process. For instance, in nonstationary environments, the per-round Bayesian regret is controlled by the entropy rate of the optimal action sequence:

$\Delta_\infty(\pi) \leq \sqrt{ \Gamma(\pi) \cdot \overline{H}_\infty(A^*) }$

Hence, environments with low entropy rate (infrequent switches or compressed structure) yield low regret (Min et al., 2023).

Information-theoretic Bayesian regret bounds often admit problem-specific sharpening:

Small-loss (“first-order”) regret: By replacing Shannon entropy with coordinate entropy and introducing scale-sensitive information ratios, one can obtain first-order bounds scaling as $\tilde O(\sqrt{dL^*})$ for the cumulative loss $L^*$ of the best action, in both stochastic and (semi-)bandit settings (Bubeck et al., 2019).
Gaussian Process Bandits: For Bayesian optimization of a GP-drawn function, the regret is governed by the maximal information gain $\gamma_T$ between $T$ queries and the GP:

$R_T \leq O(\sqrt{T\,\beta_T\,\gamma_T}),$

with $\gamma_T$ scaling with the kernel's spectral decay. For Matérn kernels, $\gamma_T = O(T^{d/(2\nu + d)})$ , and for squared exponential, $\gamma_T = O((\log T)^{d+1})$ ; this yields nearly minimax rates up to logarithmic factors (Vakili et al., 2020, Iwazaki, 2 Jun 2025).

Collaborative or Hierarchical Problems: In multi-agent linear bandits with hierarchical priors, the regret lower and upper bounds can be precisely characterized as functions of agent number $m$ , feature dimension $d$ , number of rounds $n$ , and heterogeneity $\sigma^2$ , e.g., $\widetilde{O}(d\sqrt{mn})$ in the low-heterogeneity regime (Huang et al., 19 Jun 2025).
Logarithmic Loss/Prediction: For logarithmic loss, the minimax regret is matched above and below by bounds expressed via the Shtarkov sum (information-theoretic complexity), and the upper bound is achieved by a truncated Bayesian algorithm with a covering-number complexity term (Wu et al., 2022).

5. Regret–Information Trade-off, Minimax Duality, and Lower Bounds

Recent work rigorously quantifies the fundamental regret–information trade-off using Fano-type inequalities. For hypothesis spaces or policies with packing number $K$ and accumulated mutual information $R$ (in bits), minimax Bayesian regret is lower bounded by

$\operatorname{BR}^*(T) \geq \Omega\Big( \sqrt{K\,T\,\frac{\log K}{R}}\, \Big)$

Matching upper bounds are obtained by appropriately constrained Thompson Sampling. Prior-entropy-dependent bounds, as well as bounds depending on accumulated information, establish that bits of information are “spent” for units of regret, and both minimax upper and lower bounds, within logarithmic factors, can be expressed in this currency (Shufaro et al., 2024).

The information-theoretic minimax theorem established in the partial monitoring literature links worst-case (minimax) regret to the supremum of Bayesian regret over all priors. Consequently, information-theoretic analysis not only yields Bayesian rates but directly translates into tight worst-case/adversarial minimax regret bounds in general settings (Lattimore et al., 2019).

6. Implications and Open Problems

The information-theoretic approach to Bayesian regret has unified distinct regimes (stochastic, adversarial, structured, nonstationary, partial monitoring) and brought algorithm-agnostic performance limits into sharp focus. It explicates how the environment's intrinsic uncertainty (entropy, rate-distortion, information gain) controls necessary regret and computation for learning.

Open directions include:

Precise lower bounds and tight constants in complex or high-dimensional models,
Extending to non-Bayesian (frequentist) guarantees under model misspecification or misspecified priors,
Practical computation or estimation of relevant information quantities (e.g., entropy rate, rate-distortion function) in large-scale or nonparametric settings,
Algorithmic strategies for actively trading off regret and information acquisition in domains such as active learning, online decision-making, and bandit-structured reinforcement learning.

7. Table: Representative Problem Classes and Information-Theoretic Regret Bounds

Problem type	Regret Bound Structure	Key Information Quantity
Stochastic $K$ -armed bandit	$O(\sqrt{K\,T\,H(A^*)})$	Entropy $H(A^*)$
Linear bandits	$O(d\,\sqrt{T\,\log T})$ or $O(d\,\sqrt{T})$ (chaining)	Covering/metric entropy
Nonstationary bandits	$O(\sqrt{\Gamma(\pi) H_\infty(A^*)\,T})$ per-period	Entropy rate $H_\infty(A^*)$
Partial monitoring	$O(\sqrt{n\,\log k})$ , $O(n^{2/3})$ , etc. (regime-dependent)	Diameter/Bregman divergence, entropy
Bayesian RL (MDPs)	$\leq \sum_t \sqrt{2\sigma_t^2 D_\mathrm{KL}(P^*\|\|\hat P_t)}$	KL or Wasserstein divergence per step
GP bandits (Bayesian opt)	$O(\sqrt{T\,\gamma_T\,\log T})$ , $\gamma_T$ = max information gain	Max information gain $\gamma_T$

All bounds and information quantities are problem-specific but share a common information-theoretic skeleton, with the central theme: the achievable regret is fundamentally limited by the information needed to resolve ambiguity about optimal actions, policies, or parameters, as measured by entropy, mutual information, or their generalizations (Dong et al., 2018, Lattimore et al., 2019, Gouverneur et al., 2022, Gouverneur et al., 4 Feb 2025, Gouverneur et al., 2024, Neu et al., 2022, Shufaro et al., 2024).