Regularized Markov Decision Processes

Updated 28 January 2026

Regularized MDPs are Markov Decision Processes enhanced with a penalty term to enforce smoothness, sparsity, and robustness in policy learning.
They leverage penalties like entropy bonuses, KL and Lₚ/L₁ norms, or visitation regularizers to improve sample efficiency and manage uncertainty in reinforcement learning.
Algorithmic schemes, including regularized policy iteration and actor–critic methods, offer convergence guarantees and tractable surrogates for robust dynamic programming.

A Regularized Markov Decision Process (MDP) augments the standard MDP objective with a penalty that encourages desirable policy or value function properties, such as smoothness, sparsity, robustness to misspecification, or improved sample efficiency. The regularization term can take various forms—entropy bonuses, relative entropy (KL) penalties, Lₚ or L₁-norm penalties, or temporal and visitation-frequency regularizers—and these adopt a central role in modern reinforcement learning, robust control, and approximate dynamic programming. Recent research demonstrates that regularization is intricately linked to the theory and practice of robust MDPs: certain robust min-max formulations are exactly equivalent to regularized MDPs with specific penalties. Regularization systematically unifies classical techniques in convex optimization, robust control, mirror descent, and deep RL.

1. Formal Framework and Bellman Operators

The regularized MDP augments the usual discounted-return objective with a convex penalty functional over policies: $J_{\Omega}(\pi) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t (r(s_t,a_t) - \Omega(\pi(\cdot|s_t))) \right]$ where $\Omega(\cdot)$ is typically strongly convex on the action simplex (e.g., negative entropy, squared $\ell_2$ norm, or other Bregman generators) (Geist et al., 2019).

The regularized Bellman evaluation and optimality operators are, for value $v$ and policy $\pi$ : $[T_{\pi,\Omega}v](s) = \sum_{a}\pi(a|s) [r(s,a) + \gamma \mathbb{E}_{s'}v(s')] - \Omega(\pi(\cdot|s))$

$[T_{*,\Omega}v](s) = \max_{\pi(\cdot|s)\in\Delta_A} \left\{ \sum_{a}\pi(a|s) [r(s,a) + \gamma \mathbb{E}_{s'}v(s')] - \Omega(\pi(\cdot|s)) \right\}$

with the maximum characterized via convex conjugacy: $T_{*,\Omega}v(s) = \Omega^*(q_v(s,\cdot)),\quad \text{where } q_v(s,a) = r(s,a) + \gamma \mathbb{E}_{s'}v(s')$ The greedy policy at $v$ is given by the gradient mapping $\nabla \Omega^*(q_v(s,\cdot))$ (Geist et al., 2019, Li et al., 2019).

2. Equivalence between Robustness and Regularization

Robust MDPs formalize environment uncertainty via max-min objectives over adversarial reward and/or transition perturbations defined by sets such as global or rectangular $L_p$ -balls. Foundational results establish:

Reward-uncertainty equivalence: A robust MDP with fixed transitions and $L_p$ -ball reward uncertainty is exactly equivalent to a regularized MDP with a policy-regularizer determined by the dual norm (Derman et al., 2021, Derman et al., 2023, Gadot et al., 2023):

$\max_\pi \min_{r:\|r-r_0\|_p\leq \alpha} \mathbb{E}_\pi \sum \gamma^t r(s_t,a_t) = \max_\pi \mathbb{E}_\pi \sum \gamma^t r_0(s_t,a_t) - \alpha\|d^\pi\|_q$

where $d^\pi$ is the policy occupancy measure and $q$ is the Hölder conjugate of $p$ (Gadot et al., 2023).

Transition-uncertainty equivalence: For transition-uncertainty, the equivalence introduces a value-dependent (i.e., non-stationary) regularization term:

$\max_\pi \min_{P \in \mathcal{P}} \mathbb{E}^{\pi,P} \cdots = \max_\pi \mathbb{E}^\pi [\cdots] - \operatorname{Value}\text{-}\operatorname{dependent\ regularizer}$

leading to the "twice-regularized" MDP (R²-MDP) class (Derman et al., 2021, Derman et al., 2023).

This unification implies that robust policy iteration, regularized value iteration, and policy-gradient schemes can share sample and computational complexity, with regularization yielding tractable surrogates for robust control under structural uncertainty (Kumar et al., 2022, Derman et al., 2021, Derman et al., 2023, Kumar et al., 2022).

3. Algorithmic Schemes and Convergence Guarantees

Modified Policy Iteration and Mirror Descent

The reg-MPI scheme alternates (a) a regularized greedy step—solving the convex problem, and (b) partial regularized evaluation using the policy evaluation operator (Geist et al., 2019, Neu et al., 2017): $\text{(i) } \pi_{k+1} = \arg \max_{\pi} \langle \pi, q_{v_k} \rangle - \Omega(\pi) \qquad \text{(ii) } v_{k+1} = (T_{\pi_{k+1},\Omega})^m v_k$ Under strong convexity, all necessary operators are $\gamma$ -contractions. General error propagation bounds are established for both the greedy and evaluation step inaccuracies, yielding $O(1/(1-\gamma))$ rates (Geist et al., 2019).

Policy Gradients and Actor–Critic for Regularized/RMDP

For smooth regularizers and differentiable occupancy or policy class, policy-gradient and actor–critic implementations become viable:

For frequency-regularized robust MDPs with visitation norm regularizer, an actor–critic style policy gradient is convergence-guaranteed and globally convergent at rate $O(1/\epsilon)$ under smoothness of the regularized return (Gadot et al., 2023).
For general regularizers, off-policy actor–critic and entropy-regularized Q-learning with function approximation enjoy provable stationary-point and performance guarantees under two-timescale algorithms and strong-convexity conditions (Xi et al., 2024, Li et al., 2019).

Temporal, Spatial, and Reference Policy Regularization

Temporal regularization, mixing forward and backward Markov transitions in Bellman backups, has finite bias and provides significant variance reduction without impairing Bellman contraction, useful for high-dimensional or noisy environments (Thodoroff et al., 2018).
Bayesian regularization using explicit $L_1$ or KL priors, or entropy bonuses, systematically shrinks learned policies towards prior knowledge, reducing generalization error due to model estimation noise (Gupta et al., 2022).

4. Special Regularizers: Sparsity, Policy Structure, and Robustness

Sparse and Threshold Policies

Regularization via Tsallis, polynomial, or trigonometric functions offers ergodic control of sparsity, multi-modality, and support size in optimal policies:

$J_\lambda(\pi) = \mathbb{E}^\pi \sum_{t=0}^\infty \gamma^t \left[ r(s_t,a_t) + \lambda\,\phi(\pi(a_t|s_t)) \right]$

with $\phi$ designed so that the resulting policy at each state has support controlled by KKT-derived thresholding, interpolating between deterministic, sparse, and fully-supported policies (Li et al., 2019).

Frequency-Regularization and Robustness

For global $L_p$ –reward-uncertainty, policy regularization by occupancy measure norms, i.e., $\|d^\pi\|_q$ , produces less conservative, globally coupled robustness relative to rectangular (per-state-action) uncertainty. This directly reduces the conservatism of robust policies and provides a theoretical justification for frequency regularization as in robust inverse RL (Gadot et al., 2023).

Policy-Regularized Distributionally Robust MDPs

Policy regularization combined with adversarial transition uncertainty (e.g., via reference-KL or TV-ball under linear approximation) enables provably efficient online robust RL algorithms (e.g., DR-RPO) and theoretical regret bounds that match value-based methods in "d"-rectangular linear MDPs (Gu et al., 16 Oct 2025).

5. Applications: Sample Efficiency, Generalization, and Computational Complexity

Regularized approximate linear programming (e.g., $L_1$ -norm RALP/RALPc) is critical for feature selection and generalization: it prevents overfitting when the candidate basis is large, with homotopy routines yielding efficient solution paths (Petrik et al., 2010).
In entropy-regularized and mirror-descent-based RL (TRPO, DPP, RAVI-UCB), regularization enables mirror-prox and dual-averaging analysis, offering convergence guarantees and sample complexity in tabular and linear settings (Geist et al., 2019, Moulin et al., 2023, Neu et al., 2017).
For continuous or high-cardinality state-action spaces, dimension-free learning is achieved by integrating entropy-regularization with multilevel Monte Carlo: the soft Bellman operator allows MC-based algorithms whose complexity is independent of $|\mathcal{A}|$ , with polynomial sample complexity in unbiased MLMC settings (Meunier et al., 27 Mar 2025).

6. Quantitative Error and Robustness Bounds

Sharp value and policy error bounds are derived for regularized MDP operators under both exact maximization and function approximation:

The regularization error in value is bounded as:

$\|V^*_\lambda - V^*\|_\infty \le \frac{\lambda}{1-\gamma}\, \phi\!\left(\frac{1}{|\mathcal{A}|}\right)$

for regularizer $\phi$ (Li et al., 2019).

For robust regularized settings (twice-regularized R²-MDPs), the contraction and monotonicity properties of the corresponding Bellman operators guarantee unique optimizers and geometric convergence, provided uncertainty radii are small enough (Derman et al., 2023, Derman et al., 2021).

7. Practical Trade-offs, Limitations, and Extensions

Regularized MDPs deploy the rich convex duality machinery of Legendre–Fenchel transforms, Bregman divergences, and mirror descent. They enable smooth policy evolution, reduce the need for inner min-max optimization (as in robust MDPs with rectangular uncertainty), and extend to large-scale settings via function approximation and scalable RL algorithms. Limitations include the need to bound the size of value-dependent regularizer terms for contraction, and the challenge of extending theory to nonconvex or highly nonlinear function approximation settings. The twice-regularized framework unifies policy and value regularization and robust control, providing tractable, scalable, and sample-efficient algorithms for robust and generalizable RL (Derman et al., 2023, Derman et al., 2021, Gu et al., 16 Oct 2025).

For comprehensive treatments and algorithmic schemes, see "A Theory of Regularized Markov Decision Processes" (Geist et al., 2019), "Twice regularized MDPs and the equivalence between robustness and regularization" (Derman et al., 2021), "Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization" (Gadot et al., 2023), and related works (Derman et al., 2023, Gu et al., 16 Oct 2025, Kumar et al., 2022, Li et al., 2019, Thodoroff et al., 2018).