Discrete-Time Markov Decision Processes

Updated 24 November 2025

Discrete-time MDPs are mathematical models defined by states, actions, transition probabilities, rewards, and a discount factor to optimize sequential decisions in stochastic environments.
They employ dynamic programming techniques like value iteration and policy iteration, along with linear and convex programming methods, to derive optimal policies.
Applications span reinforcement learning, control, operations research, and advanced methods such as tensor decompositions to manage high-dimensional state spaces.

A discrete-time Markov Decision Process (MDP) provides the foundational mathematical model for optimal sequential decision-making in stochastic environments. An MDP encodes the dynamics, rewards, and constraints of a system where an agent, at each time step, observes the state, selects an action, receives a reward, and the system transits probabilistically to a new state. The Markov property ensures that the transition probabilities and rewards depend only on the current state and action, not the full history. MDPs are used extensively for planning, control, and reinforcement learning across operations research, engineering, computer science, and economics.

1. Mathematical Formulation and Components

A finite discrete-time MDP is characterized by a tuple $M = (\mathcal{S}, \mathcal{A}, P, R, \gamma)$ where:

$\mathcal{S}$ is the (possibly high-dimensional) state space, typically finite, countable, or a Borel subset of a Polish space (Dufour et al., 2019, Wang et al., 2017).
$\mathcal{A}$ is the finite set of actions per state.
$P(s'|s,a)$ is the Markovian transition kernel, giving the conditional probability of next state $s'$ when action $a$ is taken in state $s$ .
$R(s,a,s')$ (or $R(s,a)$ ) is the immediate reward function, representing the measurable reward incurred by taking action $a$ in state $s$ (and possibly transitioning to $s'$ ).
$\gamma \in [0,1)$ is a discount factor for infinite-horizon formulations ( $\gamma=1$ for finite-horizon).

A policy $\pi$ maps states (and possibly time) to probability distributions over actions. The standard objective is to maximize the expected cumulative (possibly discounted) reward: $V^\pi(s) = \mathbb{E}\left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t, s_{t+1}) \bigm| s_0 = s, a_t \sim \pi(s_t) \right]$ The Bellman optimality equation for the infinite-horizon discounted setting is: $V^*(s) = \max_{a \in \mathcal{A}} \left\{ R(s, a) + \gamma \sum_{s'} P(s' | s, a) V^*(s') \right\}$ An optimal policy $\pi^*$ achieves $V^* = V^{\pi^*}$ (Wang et al., 2019, Wang et al., 2017).

2. Model Variants: Constraints, Rewards, and Structure

Discrete-time MDPs admit numerous extensions:

Total reward criteria: Expected sum of rewards over finite/infinite horizon (Dufour et al., 2019).
Risk/variance constraints: Mean-variance optimization substitutes the standard (linear) criterion with objectives involving both the mean and variance of rewards, requiring transformation into an equivalent MDP with modified rewards and discounting (Xia, 2017).
Continuous state/action: MDPs over Borel spaces require measure-theoretic formalism and often invoke convex analytical methods for analysis and solution (Dufour et al., 2019).
Multi-dimensional/factored state spaces: High-dimensional MDPs, e.g., product-of-grids, exhibit exponential growth in state space; tensor decompositions or low-dimensional representations address tractability (Kuinchtner et al., 2021, Wang et al., 2017).
Partial observability and unknown dynamics: If $P$ and/or $R$ are unknown, learning algorithms (reinforcement learning) and Bayesian approaches are employed (Tossou et al., 2019).
Elaboration-tolerant representations: Higher-level languages (e.g., pBC+) may be compiled into an MDP for efficient solution (Wang et al., 2019).

3. Solution Methods and Computational Aspects

Canonical solution techniques for discrete-time MDPs include:

Dynamic Programming
- Value Iteration: Iteratively applies the Bellman update until convergence.
- Policy Iteration: Alternates policy evaluation and greedy improvement.
Linear Programming: For finite-state/action MDPs under total expected reward (ETR) and constraints (Dufour et al., 2019).
Convex Programming and Occupation Measures: For MDPs with general state spaces and constraints, infinite-dimensional convex programming using occupation measures yields policies with optimality equivalence under continuity-compactness and Slater-type conditions (Dufour et al., 2019).

Complexity and tractability are influenced by:

State/action space size: Tabular methods scale as $O(|S|^2 |A|)$ or worse, necessitating compact representations for large structured problems.
Structure exploitation: Tensor decompositions (e.g., CANDECOMP-PARAFAC/CP) compress the transition dynamics, yielding memory and runtime savings by replacing $O(|S|^2|A|)$ dense storage with $O(R|S|+R|A|)$ (with $R \ll |S|^2 |A|$ the decomposition rank), as well as accelerating Bellman updates via low-rank contractions (Kuinchtner et al., 2021).
Approximate and sample-based optimization: For intractable DP, policy-gradient (REINFORCE), actor-critic, Monte-Carlo Tree Search, and other sample-efficient RL algorithms are used, with regret guarantees in the online/unknown case (Ferrara et al., 7 Aug 2025, Tossou et al., 2019).
Low-dimensional embeddings: Alternating deep neural networks (ADNNs) or other dimensionality reduction techniques may construct sufficient statistics or compressed state spaces, maintaining optimality guarantees under conditional independence assumptions (Wang et al., 2017).

4. Theoretical Properties and Policy Structure

MDPs possess several key structural and theoretical properties:

Markov Property: Transition probabilities and rewards are determined by the present $(s,a)$ , not the trajectory, enforcing memorylessness.
Policy Regularity: Under standard conditions, optimal (or $\epsilon$ -optimal) stationary deterministic or randomized policies suffice. For mean-variance-constrained problems, optimality may be achieved within the set of stationary deterministic policies; randomization yields no advantage (Xia, 2017).
Sufficiency and State Compression: If there exists a feature map $\phi: \mathcal{S} \to \mathbb{R}^q$ such that the conditional independence $Y^{t+1} \perp S^t | \phi(S^t), A^t$ holds (with $Y^{t+1}$ the reward and next compressed state), then the MDP admits an equivalent low-dimensional representation, and optimal policies can be lifted from the compressed to the original space (Wang et al., 2017).

A table summarizing MDP problem classes and solution approaches:

MDP Variant	Solution Approach	Key Reference
Unconstrained, tabular	Value/Policy Iteration	(Wang et al., 2019)
Multi-dimensional	CP decomposition, tensor DP	(Kuinchtner et al., 2021)
Mean-variance	Equivalent DP via reward lift	(Xia, 2017)
Large state, unknown	ADNN, Q-learning, RL	(Wang et al., 2017)
Constrained, Borel	Convex programming, occupation	(Dufour et al., 2019)

5. Applications and Empirical Studies

MDPs model a broad spectrum of stochastic sequential control problems:

Satellite collision avoidance: MDPs with continuous states, discrete actions, and finite horizon guide early collision avoidance maneuvers, optimizing a trade-off between fuel and risk. Policy-gradient REINFORCE methods directly optimize expected costs under domain-specific dynamics (Ferrara et al., 7 Aug 2025).
Fair resource allocation: Finite-horizon MDPs encode online allocation with fairness objectives. Regularized, model-based heuristics (SAFFE, SAFFE-D) utilize analytic and concentration-based guarantees to achieve near-optimal Nash social welfare (Hassanzadeh et al., 2023).
Healthcare interventions: High-dimensional temporal state MDPs are compressed via learned neural feature maps, allowing practical policy optimization and interpretation in mobile health trials (Wang et al., 2017).
Undiscounted, unknown MDPs: Bayesian optimistic RL (e.g., BUCRL) achieves minimax-optimal regret rates by constructing confidence sets via Beta/Binomial quantiles and solving optimistic MDPs in extended value iteration (Tossou et al., 2019).
Logic-based planning: Probabilistic action language pBC+ enables declarative, elaboration-tolerant modeling, with translation to MDPs for efficient policy computation (Wang et al., 2019).

6. Advanced Topics and Recent Developments

Recent research directions include:

Tensor algebra in MDPs: CP, tensor-train, and Tucker decompositions enhance scalability, particularly for structured or multidimensional state spaces. Empirical results indicate up to $>$ 90% memory savings and up to 80% runtime improvements compared to tabular methods for large GridWorlds when $R \ll |S|^2|A|$ , with practical upper bounds for $R$ obtained by storing transitions per $(s, a)$ as rank-one components (Kuinchtner et al., 2021).
Convex duality and occupation measures: Infinite-dimensional linear or convex programs over signed measure variables extend the reach of MDP theory into general Borel spaces, supporting relaxed continuity, unboundedness, and even signed reward/constraint functions. Under Slater-type regularity, optimal stationary (randomized) policies can be recovered (Dufour et al., 2019).
Policy structure for constrained problems: For mean-variance optimization, constraint decoupling reduces the feasible policy space to products over states, supporting efficient policy iteration. Deterministic policies suffice for minimal variance, as the variance difference is linear over mixing coefficients (Xia, 2017).
Sample-efficient learning and regret bounds: Optimistic RL algorithms, including Bayesian approaches (BUCRL), achieve the minimax $\tilde O(\sqrt{DSAT})$ regret, where $D$ is the diameter, $S$ the state count, $A$ the action count, and $T$ the time horizon. Sharper deviation bounds (e.g., KL, Beta/Binomial quantiles) improve confidence intervals and optimize regret constants (Tossou et al., 2019).
Practical integration: Systems such as pbcplus2mdp provide automated compilation from high-level logic-based domain modeling (pBC+) to explicit MDPs, leveraging established solvers for policy computation (Wang et al., 2019).

7. Limitations and Open Challenges

Despite their analytical tractability and expressive power, discrete-time MDPs are subject to several limitations:

State and action “curse of dimensionality” impedes classical DP in large or continuous domains, mitigated yet not eliminated by factorization, compression, or function approximation (Kuinchtner et al., 2021, Wang et al., 2017).
Efficient GPU acceleration for tensor-based solvers is limited by the prevalence of small inner products and memory bandwidth/communication overhead (Kuinchtner et al., 2021).
Extending compact solutions to broad dynamic languages and richer stochastic planning domains (e.g., RDDL-format probabilistic planning) remains an area of active research (Kuinchtner et al., 2021, Wang et al., 2019).
Bayesian RL with high-confidence frequentist guarantees is technically challenging, especially regarding tightness of quantile confidence sets and their computational cost (Tossou et al., 2019).
Further exploration of nonadditive and risk-sensitive criteria (beyond mean-variance and standard reward maximization) necessitates new analytic and algorithmic frameworks (Xia, 2017).

Discrete-time MDPs remain an indispensable tool in the study and solution of sequential stochastic optimization problems, with active advances in computational, theoretical, and application-specific methodologies reflected in current research literature.