Papers
Topics
Authors
Recent
Search
2000 character limit reached

Regularized Markov Decision Processes

Updated 28 January 2026
  • Regularized MDPs are Markov Decision Processes enhanced with a penalty term to enforce smoothness, sparsity, and robustness in policy learning.
  • They leverage penalties like entropy bonuses, KL and Lₚ/L₁ norms, or visitation regularizers to improve sample efficiency and manage uncertainty in reinforcement learning.
  • Algorithmic schemes, including regularized policy iteration and actor–critic methods, offer convergence guarantees and tractable surrogates for robust dynamic programming.

A Regularized Markov Decision Process (MDP) augments the standard MDP objective with a penalty that encourages desirable policy or value function properties, such as smoothness, sparsity, robustness to misspecification, or improved sample efficiency. The regularization term can take various forms—entropy bonuses, relative entropy (KL) penalties, Lₚ or L₁-norm penalties, or temporal and visitation-frequency regularizers—and these adopt a central role in modern reinforcement learning, robust control, and approximate dynamic programming. Recent research demonstrates that regularization is intricately linked to the theory and practice of robust MDPs: certain robust min-max formulations are exactly equivalent to regularized MDPs with specific penalties. Regularization systematically unifies classical techniques in convex optimization, robust control, mirror descent, and deep RL.

1. Formal Framework and Bellman Operators

The regularized MDP augments the usual discounted-return objective with a convex penalty functional over policies: JΩ(π)=Eπ[t=0γt(r(st,at)Ω(π(st)))]J_{\Omega}(\pi) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t (r(s_t,a_t) - \Omega(\pi(\cdot|s_t))) \right] where Ω()\Omega(\cdot) is typically strongly convex on the action simplex (e.g., negative entropy, squared 2\ell_2 norm, or other Bregman generators) (Geist et al., 2019).

The regularized Bellman evaluation and optimality operators are, for value vv and policy π\pi: [Tπ,Ωv](s)=aπ(as)[r(s,a)+γEsv(s)]Ω(π(s))[T_{\pi,\Omega}v](s) = \sum_{a}\pi(a|s) [r(s,a) + \gamma \mathbb{E}_{s'}v(s')] - \Omega(\pi(\cdot|s))

[T,Ωv](s)=maxπ(s)ΔA{aπ(as)[r(s,a)+γEsv(s)]Ω(π(s))}[T_{*,\Omega}v](s) = \max_{\pi(\cdot|s)\in\Delta_A} \left\{ \sum_{a}\pi(a|s) [r(s,a) + \gamma \mathbb{E}_{s'}v(s')] - \Omega(\pi(\cdot|s)) \right\}

with the maximum characterized via convex conjugacy: T,Ωv(s)=Ω(qv(s,)),where qv(s,a)=r(s,a)+γEsv(s)T_{*,\Omega}v(s) = \Omega^*(q_v(s,\cdot)),\quad \text{where } q_v(s,a) = r(s,a) + \gamma \mathbb{E}_{s'}v(s') The greedy policy at vv is given by the gradient mapping Ω(qv(s,))\nabla \Omega^*(q_v(s,\cdot)) (Geist et al., 2019, Li et al., 2019).

2. Equivalence between Robustness and Regularization

Robust MDPs formalize environment uncertainty via max-min objectives over adversarial reward and/or transition perturbations defined by sets such as global or rectangular LpL_p-balls. Foundational results establish:

maxπminr:rr0pαEπγtr(st,at)=maxπEπγtr0(st,at)αdπq\max_\pi \min_{r:\|r-r_0\|_p\leq \alpha} \mathbb{E}_\pi \sum \gamma^t r(s_t,a_t) = \max_\pi \mathbb{E}_\pi \sum \gamma^t r_0(s_t,a_t) - \alpha\|d^\pi\|_q

where dπd^\pi is the policy occupancy measure and qq is the Hölder conjugate of pp (Gadot et al., 2023).

  • Transition-uncertainty equivalence: For transition-uncertainty, the equivalence introduces a value-dependent (i.e., non-stationary) regularization term:

maxπminPPEπ,P=maxπEπ[]Value-dependent regularizer\max_\pi \min_{P \in \mathcal{P}} \mathbb{E}^{\pi,P} \cdots = \max_\pi \mathbb{E}^\pi [\cdots] - \operatorname{Value}\text{-}\operatorname{dependent\ regularizer}

leading to the "twice-regularized" MDP (R²-MDP) class (Derman et al., 2021, Derman et al., 2023).

3. Algorithmic Schemes and Convergence Guarantees

Modified Policy Iteration and Mirror Descent

The reg-MPI scheme alternates (a) a regularized greedy step—solving the convex problem, and (b) partial regularized evaluation using the policy evaluation operator (Geist et al., 2019, Neu et al., 2017): (i) πk+1=argmaxππ,qvkΩ(π)(ii) vk+1=(Tπk+1,Ω)mvk\text{(i) } \pi_{k+1} = \arg \max_{\pi} \langle \pi, q_{v_k} \rangle - \Omega(\pi) \qquad \text{(ii) } v_{k+1} = (T_{\pi_{k+1},\Omega})^m v_k Under strong convexity, all necessary operators are γ\gamma-contractions. General error propagation bounds are established for both the greedy and evaluation step inaccuracies, yielding O(1/(1γ))O(1/(1-\gamma)) rates (Geist et al., 2019).

Policy Gradients and Actor–Critic for Regularized/RMDP

For smooth regularizers and differentiable occupancy or policy class, policy-gradient and actor–critic implementations become viable:

  • For frequency-regularized robust MDPs with visitation norm regularizer, an actor–critic style policy gradient is convergence-guaranteed and globally convergent at rate O(1/ϵ)O(1/\epsilon) under smoothness of the regularized return (Gadot et al., 2023).
  • For general regularizers, off-policy actor–critic and entropy-regularized Q-learning with function approximation enjoy provable stationary-point and performance guarantees under two-timescale algorithms and strong-convexity conditions (Xi et al., 2024, Li et al., 2019).

Temporal, Spatial, and Reference Policy Regularization

  • Temporal regularization, mixing forward and backward Markov transitions in Bellman backups, has finite bias and provides significant variance reduction without impairing Bellman contraction, useful for high-dimensional or noisy environments (Thodoroff et al., 2018).
  • Bayesian regularization using explicit L1L_1 or KL priors, or entropy bonuses, systematically shrinks learned policies towards prior knowledge, reducing generalization error due to model estimation noise (Gupta et al., 2022).

4. Special Regularizers: Sparsity, Policy Structure, and Robustness

Sparse and Threshold Policies

  • Regularization via Tsallis, polynomial, or trigonometric functions offers ergodic control of sparsity, multi-modality, and support size in optimal policies:

Jλ(π)=Eπt=0γt[r(st,at)+λϕ(π(atst))]J_\lambda(\pi) = \mathbb{E}^\pi \sum_{t=0}^\infty \gamma^t \left[ r(s_t,a_t) + \lambda\,\phi(\pi(a_t|s_t)) \right]

with ϕ\phi designed so that the resulting policy at each state has support controlled by KKT-derived thresholding, interpolating between deterministic, sparse, and fully-supported policies (Li et al., 2019).

Frequency-Regularization and Robustness

  • For global LpL_p–reward-uncertainty, policy regularization by occupancy measure norms, i.e., dπq\|d^\pi\|_q, produces less conservative, globally coupled robustness relative to rectangular (per-state-action) uncertainty. This directly reduces the conservatism of robust policies and provides a theoretical justification for frequency regularization as in robust inverse RL (Gadot et al., 2023).

Policy-Regularized Distributionally Robust MDPs

  • Policy regularization combined with adversarial transition uncertainty (e.g., via reference-KL or TV-ball under linear approximation) enables provably efficient online robust RL algorithms (e.g., DR-RPO) and theoretical regret bounds that match value-based methods in "d"-rectangular linear MDPs (Gu et al., 16 Oct 2025).

5. Applications: Sample Efficiency, Generalization, and Computational Complexity

  • Regularized approximate linear programming (e.g., L1L_1-norm RALP/RALPc) is critical for feature selection and generalization: it prevents overfitting when the candidate basis is large, with homotopy routines yielding efficient solution paths (Petrik et al., 2010).
  • In entropy-regularized and mirror-descent-based RL (TRPO, DPP, RAVI-UCB), regularization enables mirror-prox and dual-averaging analysis, offering convergence guarantees and sample complexity in tabular and linear settings (Geist et al., 2019, Moulin et al., 2023, Neu et al., 2017).
  • For continuous or high-cardinality state-action spaces, dimension-free learning is achieved by integrating entropy-regularization with multilevel Monte Carlo: the soft Bellman operator allows MC-based algorithms whose complexity is independent of A|\mathcal{A}|, with polynomial sample complexity in unbiased MLMC settings (Meunier et al., 27 Mar 2025).

6. Quantitative Error and Robustness Bounds

Sharp value and policy error bounds are derived for regularized MDP operators under both exact maximization and function approximation:

  • The regularization error in value is bounded as:

VλVλ1γϕ ⁣(1A)\|V^*_\lambda - V^*\|_\infty \le \frac{\lambda}{1-\gamma}\, \phi\!\left(\frac{1}{|\mathcal{A}|}\right)

for regularizer ϕ\phi (Li et al., 2019).

  • For robust regularized settings (twice-regularized R²-MDPs), the contraction and monotonicity properties of the corresponding Bellman operators guarantee unique optimizers and geometric convergence, provided uncertainty radii are small enough (Derman et al., 2023, Derman et al., 2021).

7. Practical Trade-offs, Limitations, and Extensions

Regularized MDPs deploy the rich convex duality machinery of Legendre–Fenchel transforms, Bregman divergences, and mirror descent. They enable smooth policy evolution, reduce the need for inner min-max optimization (as in robust MDPs with rectangular uncertainty), and extend to large-scale settings via function approximation and scalable RL algorithms. Limitations include the need to bound the size of value-dependent regularizer terms for contraction, and the challenge of extending theory to nonconvex or highly nonlinear function approximation settings. The twice-regularized framework unifies policy and value regularization and robust control, providing tractable, scalable, and sample-efficient algorithms for robust and generalizable RL (Derman et al., 2023, Derman et al., 2021, Gu et al., 16 Oct 2025).


For comprehensive treatments and algorithmic schemes, see "A Theory of Regularized Markov Decision Processes" (Geist et al., 2019), "Twice regularized MDPs and the equivalence between robustness and regularization" (Derman et al., 2021), "Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization" (Gadot et al., 2023), and related works (Derman et al., 2023, Gu et al., 16 Oct 2025, Kumar et al., 2022, Li et al., 2019, Thodoroff et al., 2018).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Regularized Markov Decision Processes (MDPs).