Markov–Bandit Framework

Updated 28 January 2026

Markov–Bandit Framework is a unified paradigm that combines hidden Markov processes with bandit arm selection for sequential decision-making in dynamic environments.
It addresses the challenge of balancing exploration and exploitation under partial feedback and uncertainty about state dynamics and reward distributions.
Algorithmic approaches leverage spectral methods, UCB indices, and mirror descent to achieve theoretical regret bounds and practical performance.

The Markov–Bandit Framework is a unified paradigm for sequential decision-making in non-stationary and structured stochastic environments, characterized by the interplay between Markovian state evolution (often unobserved or partially observed) and bandit-style arm selection. This framework encompasses a broad class of problems where the agent observes only partial feedback about a system whose underlying dynamics follow a Markov process, and must balance exploration and exploitation without full knowledge of the dynamics or reward distributions. Models under this framework arise in regime-switching bandits, Markovian reward bandits, restless bandits, expert selection in MDPs, multi-objective control, and more. Theoretical and algorithmic advances draw on tools from hidden Markov models, spectral estimation, online learning, index policies, and regret analysis.

1. Model Variants and Formal Definitions

Several archetypal Markov–Bandit models appear in the literature, each distinguished by how Markovian structure interacts with the bandit learning protocol:

Regime-Switching Bandits: An unobservable, finite-state Markov chain $M_t$ modulates the reward distributions of $K$ arms. At time $t$ the agent chooses arm $I_t\in \{1,...,K\}$ and receives reward $R_t\sim R_{I_t}(\cdot|M_t)$ where $\{R_a(\cdot|i)\}$ are arm- and state-dependent distributions. The agent does not observe $M_t$ (Zhou et al., 2020).
Rested Markovian Bandits: Each arm $i$ is a finite-state Markov chain, which transitions only when played, and gives state-dependent rewards. The player observes state and reward only for the chosen arm (Tekin et al., 2010).
Restless Bandits: All arms' states evolve continuously, regardless of action; with “hidden” (not directly observable) or observable state transitions, as in the Feedback MAB (0711.3861), UoI scheduling (Chen et al., 2021), or playout recommendation RMAB (Meshram et al., 2017).
Expert Selection Bandits in MDPs: The agent selects from a set of fixed expert policies to run in episodes on an MDP. Each expert induces different Markov chain dynamics over states, and episode rewards are averaged (Rubies-Royo et al., 2020, Mazumdar et al., 2017).
Aggregate / Distorted Feedback MDPs: The agent selects policies for an episodic MDP, but observes only aggregate (trajectory-level) bandit feedback, often under adversarial loss or model distortion (Cohen et al., 2021).
Multi-objective and Constrained MDPs (Markov–Bandit games): The agent faces multiple objective functions or constraints, often modeled as a zero-sum game where one player chooses a constraint (“bandit”) and the other chooses actions (“MDP”) (Gattami et al., 2019).

Common to all is the synthesis of bandit regret minimization with Markovian or POMDP structure, necessitating learning both model parameters and optimal decision strategies.

2. Learning Challenges and Exploitation–Exploration Structure

The principal challenge in Markov–Bandit problems is the agent’s ignorance of key model components:

Hidden State: The Markov process is often unobserved, making inference a POMDP. The agent must maintain and update a posterior belief $b_t \in \Delta_S$ over regimes based only on action–reward history (Zhou et al., 2020).
Parameter Uncertainty: Transition matrices, reward kernels, or emission parameters are unknown and must be estimated online for belief updates and policy optimization.
Partial/Delayed Feedback: The agent may only observe cumulative or aggregate rewards, not individual rewards per state-action, or only upon playing certain arms (Cohen et al., 2021, 0711.3861).
Structure Exploitation: In high-dimensional or structured action spaces, side-information or problem structure (e.g., known support sets or geometry) is critical to avoid exponential sample complexity (Yemini et al., 2019).

Learning methods must address the dual estimation–control nature: constructing parameter estimates from exploration data, updating beliefs/policies, and optimizing arm selection or actions in light of uncertainty.

3. Algorithmic Techniques: Spectral Methods, Bandit Indices, and Mirror Descent

Algorithmic solutions in the Markov–Bandit framework leverage a variety of advanced estimation and control techniques:

Spectral Method-of-Moments: Used for learning parameters in hidden Markov models. Action–reward pairs are embedded as observations in an HMM; empirical moments yield estimates of transition and emission matrices via tensor decomposition. These methods achieve finite-sample error rates $O(\sqrt{\frac{\log(1/\delta)}{n}})$ for mean and transition estimates (Zhou et al., 2020).
UCB and KL-UCB for Markovian Rewards: Sample-mean or Kullback-Leibler upper confidence index policies are adapted to cope with Markovian dependence, using concentration inequalities for Markov chains, with regret of order $K$ 0 under proper mixing and gap conditions (Tekin et al., 2010, Roy et al., 2020).
Whittle Index and Approximate Policies: For restless or partially observable bandits, index policies are derived via Lagrangian relaxation, Bellman equations, and verification of indexability. For certain classes, provably optimality or $K$ 1-approximate solutions are achievable (Chen et al., 2021, 0711.3861, Meshram et al., 2017).
Episodic UCB for Expert Selection: In the context of switching among expert policies on MDPs, UCB-style index algorithms with confidence bounds and episodic evaluation deliver logarithmic regret in the number of episodes (Rubies-Royo et al., 2020, Mazumdar et al., 2017).
Distorted Linear Bandits and Mirror Descent: In online MDPs with adversarially distorted or aggregate bandit feedback, the occupancy measure optimization is reduced to a “distorted linear bandit,” solved using online mirror descent with self-concordant barrier regularization and increasing learning rates (Cohen et al., 2021).
Bayesian Hypothesis Testing and Thompson Sampling: For environments of unknown structure—CB versus MDP—Bayesian evidence integration and Thompson Sampling over hypotheses and model parameters alternates between bandit and MDP strategies (Zhang et al., 2022).

This diversity of approaches illustrates the flexibility and technical richness of the Markov–Bandit framework.

4. Regret Bounds and Theoretical Guarantees

Tight regret guarantees are achieved under varied technical conditions depending on the model:

Model/Algorithm	Regret Bound	Notes
Regime-Switching Bandits (SEEU) (Zhou et al., 2020)	$K$ 2	Belief-POMDP; spectral estimation; optimism-in-bounds
Rested Markov Bandits (UCB) (Tekin et al., 2010)	$K$ 3	Requires lower bound on Markov chain gap
TV-KL-UCB for Markov & i.i.d. (Roy et al., 2020)	$K$ 4	Adaptive; optimal constants in both regimes
Restless/Feedback Bandits (0711.3861)	$K$ 5 approximation	Balanced dual LP; indexability; Lyapunov method
Expert Selection in MDPs (Rubies-Royo et al., 2020, Mazumdar et al., 2017)	$K$ 6 (episodes)	Mixing corrections; episodic mean estimation
Online MDPs (DLB + Mirror Descent) (Cohen et al., 2021)	$K$ 7	Poly( $K$ 8) factors; adversarial losses
UCRL-style for hidden Markov bandit (Yemini et al., 2019)	$K$ 9	Exploits linear reward structure, side information
Markov–Bandit TS (CB/MDP unknown) (Zhang et al., 2022)	$t$ 0, or $t$ 1	Automatically recovers best rate once environment class identified

These results are supported by explicit decompositions of exploration and exploitation phases, the use of high-probability confidence sets, and fine-grained control of belief and parameter estimation errors.

5. Practical and Structural Extensions

Key extensions and structural insights arising from the literature include:

High-dimensional Action/State Spaces: Algorithms that exploit problem structure (e.g., action set convexity, local support sets) avoid dependency on the combinatorial size of the space (Yemini et al., 2019).
Aggregate/Semi-bandit Feedback: Aggregate reward feedback after a trajectory (e.g., in online MDPs) can be handled by convex relaxation and bandit linear optimization over occupancy measures (Cohen et al., 2021).
Restless and Indexable Bandits: Threshold and index policies provide tractable controller design even for PSPACE-hard restless bandit problems, under monotonicity and separability (0711.3861).
Corrupted Contexts and State Evolution: Hybrid schemes that dynamically arbitrate between context-based bandits and Markovian state tracking are robust to context corruption or hidden states (Galozy et al., 2020).
Constrained and Multi-objective Control: Markov–Bandit games formalize zero-sum interactions between agent and “bandit” opponent over constraints, with convergent Q-learning methods (Gattami et al., 2019).
Hybrid and Adaptive Settings: Bayesian hypothesis testing and adaptive schemes interpolate between CB and MDP learning when environmental structure is unknown (Zhang et al., 2022).

These principles permit Markov–Bandit algorithms to scale gracefully, support structured policy classes, and adapt to feedback modalities encountered in real-world sequential decision settings.

6. Applications and Empirical Performance

Realized applications span recommender systems, proactive vision tasks, scheduling under staleness penalties, resource allocation, and expert selection in control environments. Notable empirical findings include:

Regime-Switching Bandits (Zhou et al., 2020): Empirical regret matches $t$ 2 theoretical prediction, outperforming baselines in Markovian non-stationarity.
Expert Selection in Atari (Rubies-Royo et al., 2020): Episodic UCB quickly matches the performance of the best Q-network even under occlusions.
Restless Bandit Index Policies (Chen et al., 2021, Meshram et al., 2017): Whittle index policies achieve near-optimal (<1% regret) and outperform myopic/round-robin policies under realistic process noise.
Hybrid Markov–Bandit–MDP Learning (Zhang et al., 2022): Adaptive methods efficiently differentiate between CB and MDP structure, achieving best-of-both-worlds regret depending on the underlying environment.

Many of these methods are robust to parameter settings, scale with dimension, and adapt to changing environments.

7. Outlook: Open Questions and Research Directions

Active areas for expansion include:

Generalization to nonstationary or restless environments with richer side constraints
Partial observability with correlated or non-Markov latent processes
Extension to multi-agent, multi-objective, or adversarial settings
Connections to black-box expert selection, contextual bandits with staleness, and meta-learning

The Markov–Bandit paradigm remains central for the study of learning and control in dynamic, uncertain environments with structure intermediate between traditional bandit problems and general MDPs. The synthesis of HMM estimation, bandit theory, convex optimization, and Bayesian adaptation makes it a fertile domain for both theoretical advances and practical algorithm design.