Papers
Topics
Authors
Recent
Search
2000 character limit reached

Continuous-Time Policy Improvement Algorithm

Updated 4 October 2025
  • Policy Improvement Algorithm (PIA) is a recursive method that optimizes control policies in continuous time by iteratively evaluating the value function and performing local improvements.
  • The algorithm leverages a weak stochastic control framework, ensuring well-posedness and convergence through deterministic analytic conditions and diffusion process modeling.
  • PIA effectively addresses continuous-time HJB equations in both infinite and finite horizon settings, offering practical insights for reinforcement learning and controlled diffusions.

The Policy Improvement Algorithm (PIA) is a recursive method for synthesizing optimal or near-optimal control policies in continuous-time stochastic control problems. Its core principle is to generate a sequence of policies by alternately evaluating the value function of the current policy and performing a pointwise local improvement. The framework developed in "On the policy improvement algorithm in continuous time" (Jacka et al., 2015) rigorously extends PIA beyond classical discrete-time, finite-state or Markov decision process settings to the broad regime of continuous-time processes, diffusion-controlled systems, and weak formulations crucial to stochastic analysis. This extension delivers a robust theoretical and practical foundation for policy improvement in a continuous time setting, establishing general deterministic analytic conditions for well-posedness and convergence, and highlighting the need for and implications of weak stochastic control.

1. Mathematical Framework and Characterization

Let SS denote the state space, and AA a compact metric space of control actions. The controlled process (Xt)t0\left(X_t\right)_{t \ge 0} evolves with a generator La\mathcal{L}^a depending on aAa \in A. Given running cost f(x,a)f(x,a), terminal cost gg, and domain DSD \subseteq S, the value function is defined via the weak formulation as

V(x)=supIAxE[0Tf(Xt,It)dt+g(XT)1{T<}],(2.1)V(x) = \sup_{I \in A_x} \mathbb{E}\left[ \int_0^T f(X_t, I_t) dt + g(X_T) 1_{\{T < \infty\}} \right], \tag{2.1}

where TT is the exit time from DD, and AxA_x denotes the admissible control processes (allowing the probability space to depend on II).

A Markov policy T:SAT : S \to A prescribes It=T(Xt)I_t = T(X_t); not all such TT are improvable—in the context of PIA, one restricts to those TT for which the corresponding payoff VTV^T lies in a regularity class C\mathcal{C} of sufficiently smooth functions. The set I\mathcal{I} of improvable policies becomes the search space for PIA.

The fundamental iterative step is: given any improvable Markov policy TIT\in\mathcal{I},

T(x)argmaxaA([LaVT](x)+f(x,a)).(Def. 1)T'(x) \in \arg\max_{a\in A} \Big( [\mathcal{L}^a V^T](x) + f(x,a) \Big). \tag{Def. 1}

This maximization is pointwise in xx; existence of maximizers and regularity of TT' are ensured by continuity and compactness assumptions on AA and smoothness of the map (x,a)Laϕ(x)+f(x,a)(x,a)\mapsto \mathcal{L}^a\phi(x)+f(x,a). The improved policy TT' is again Markov and, under suitable conditions, improvable.

PIA then defines a sequence (Tn)nN(T_n)_{n\in\mathbb{N}} via Tn+1T_{n+1} as the improvement of TnT_n, with V(Tn)V^{(T_n)} the associated value function.

2. Convergence and Monotonic Improvement

Under deterministic analytic assumptions (joint continuity of the Hamiltonian, compactness of AA, domain regularity, and regularity-preserving properties of the semigroup), rigorous monotonicity and convergence of the PIA are established. The main claims are:

  • Monotonicity: For all xSx\in S and nNn\in\mathbb{N},

V(Tn+1)(x)V(Tn)(x).V^{(T_{n+1})}(x) \geq V^{(T_n)}(x).

  • Pointwise Convergence: Under further regularity and compactness (Assumptions (As3)–(As5)), for all xSx\in S,

limnV(Tn)(x)=V(x).\lim_{n\to\infty} V^{(T_n)}(x) = V(x).

  • Policy Convergence: If (As6)-(As8) hold, there exists a subsequence (Tnk)(T_{n_k}) that converges uniformly on compacts to an optimal Markov policy TT^*, and V(T)=VV^{(T^*)}=V.

A crucial technical requirement is the uniform vanishing residual: limk[LTnk+1V(Tnk)(x)+f(x,Tnk+1(x))]=0uniformly in xD,\lim_{k \to \infty} \left[\mathcal{L}^{T_{n_k+1}} V^{(T_{n_k})}(x) + f(x, T_{n_k+1}(x))\right] = 0 \quad \text{uniformly in } x\in D, ensuring that improvements in the policy correspond to true Hamiltonian maximization and that the limiting policy satisfies the optimality PDE.

3. Necessity and Role of the Weak Formulation

The work firmly demonstrates that the natural setting for continuous-time PIA is the weak stochastic control framework. In strong formulations, admissible controls and process solutions must be constructed on a fixed probability space. This is insufficient in many continuous-time scenarios. For example, the optimal control T(x)=sgn(x)T(x)=\operatorname{sgn}(x) may lead to SDEs for which no strong solution exists due to pathwise uniqueness violations.

Weak formulation allows control and solution processes to be defined on varying stochastic bases, enabling the construction of optimal controls for models lacking strong solutions. This is exhibited via detailed examples, including cases where the control set is either finite (A={1,1}A=\{-1,1\}) or continuous (A=[1,1]A=[-1,1]): dXt=adVt,dX_t = a\,dV_t, in which the weak solution machinery is required to correctly account for the joint law of (X,a)(X, a). Thus, the PIA in continuous time relies fundamentally on weak solutions and the broad admissibility class they provide.

4. Diffusion-Type Examples and PDE Structure

The algorithm is particularly tractable in diffusion-controlled systems. Two canonical cases are provided:

  1. Discounted Infinite Horizon Diffusions: With D=RdD=\mathbb{R}^d and the generator

Laφ(x)=12a(x,a)Hφ(x)a(x,a)+p(x,a)φ(x)a(x,a)φ(x),\mathcal{L}^a\varphi(x)=\frac12 a(x,a)^\top H\varphi(x)a(x,a) + p(x,a)^\top \nabla \varphi(x) - a(x,a) \varphi(x),

and under assumptions ensuring uniform ellipticity, Lipschitz dynamics, and a compact AA, all PIA hypotheses are satisfied and convergence follows. The iterative improvement step corresponds to pointwise maximization of the generalized Hamiltonian.

  1. Finite Horizon Diffusions: The state space extends to (x,t)(x, t). Given suitable regularity (Lipschitz, bounded terminal cost), PIA remains applicable by the same analytical reasoning.

In both settings, the iterative step

Tn+1(x)argmaxaA{LaV(Tn)(x)+f(x,a)}T_{n+1}(x) \in \arg\max_{a\in A} \left\{ \mathcal{L}^a V^{(T_n)}(x) + f(x,a) \right\}

interprets as iteratively approximating the solution to the Hamilton–Jacobi–Bellman (HJB) equation, using smoothness properties of La\mathcal{L}^a and f(x,a)f(x,a).

5. Algorithmic Properties, Residuals, and Optimality

The analytic machinery ensures that the residual

[LTn+1V(Tn)(x)+f(x,Tn+1(x))]\left[ \mathcal{L}^{T_{n+1}} V^{(T_n)}(x) + f(x, T_{n+1}(x)) \right]

shrinks uniformly along a convergent subsequence, and that the limiting policy TT^* is a pointwise maximizer: T(x)argmaxaA{LaV(x)+f(x,a)}.T^*(x) \in \arg\max_{a\in A} \left\{ \mathcal{L}^a V(x) + f(x,a) \right\}. If the process is sufficiently regular, VV solves the HJB PDE and TT^* is optimal in the sense that V(T)=VV^{(T^*)}=V.

This architecture ensures that the algorithm recovers the optimal value function and a corresponding Markov control policy, provided the initial policy and system data satisfy the regularity and compactness assumptions.

6. Broader Impact, Limitations, and Significance

The paper demonstrates that the continuous-time PIA, under deterministic analytic conditions on the data (regularity, compactness, uniform ellipticity), is

  • Well-posed: The algorithm is defined without time discretization or indirect limit arguments.
  • Convergent: The value/payoff sequence converges monotonically to the optimal value function; policies converge to an optimal Markov policy on compact subsets.
  • General: The framework accommodates controlled diffusions, potentially degenerate, over general domains and with general running/terminal costs.

The emphasis on the weak formulation generalizes the class of admissible controls, permitting solutions in scenarios where the strong formulation breaks down (notably when natural optimal policies cannot be realized via solutions to strongly posed SDEs).

A further significance is the compatibility with PDE methods—PIA can be interpreted as a dynamic programming method that produces classical solutions to HJB equations when smoothness permits, and aligns with contemporary perspectives on verification theorems and viscosity solution theory.

The extension of PIA to continuous-time, weakly-controlled, and diffusion-based problems bridges the theory of dynamic programming, stochastic process analysis, and PDE methods for control. This framework provides a robust theoretical underpinning for practical policy improvement implementations in continuous domains, supplies tools for verification by linking iterated policy improvement to convergence of residuals in HJB-type PDEs, and presents analytic conditions that can be verified using the deterministic structure of the model.

The methodology’s flexibility is evidenced by its applicability to problems outside classical frameworks (e.g., those with singularities in SDEs, boundary exit phenomena, general reward structures) and its relevance for modern computational and reinforcement learning settings in high-dimensional, continuous-time domains. The PIA as developed in (Jacka et al., 2015) forms a cornerstone for further research on continuous-time reinforcement learning and the theory of controlled diffusions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Policy Improvement Algorithm (PIA).