Continuous-Time Policy Improvement Algorithm

Updated 4 October 2025

Policy Improvement Algorithm (PIA) is a recursive method that optimizes control policies in continuous time by iteratively evaluating the value function and performing local improvements.
The algorithm leverages a weak stochastic control framework, ensuring well-posedness and convergence through deterministic analytic conditions and diffusion process modeling.
PIA effectively addresses continuous-time HJB equations in both infinite and finite horizon settings, offering practical insights for reinforcement learning and controlled diffusions.

The Policy Improvement Algorithm (PIA) is a recursive method for synthesizing optimal or near-optimal control policies in continuous-time stochastic control problems. Its core principle is to generate a sequence of policies by alternately evaluating the value function of the current policy and performing a pointwise local improvement. The framework developed in "On the policy improvement algorithm in continuous time" (Jacka et al., 2015) rigorously extends PIA beyond classical discrete-time, finite-state or Markov decision process settings to the broad regime of continuous-time processes, diffusion-controlled systems, and weak formulations crucial to stochastic analysis. This extension delivers a robust theoretical and practical foundation for policy improvement in a continuous time setting, establishing general deterministic analytic conditions for well-posedness and convergence, and highlighting the need for and implications of weak stochastic control.

1. Mathematical Framework and Characterization

Let $S$ denote the state space, and $A$ a compact metric space of control actions. The controlled process $\left(X_t\right)_{t \ge 0}$ evolves with a generator $\mathcal{L}^a$ depending on $a \in A$ . Given running cost $f(x,a)$ , terminal cost $g$ , and domain $D \subseteq S$ , the value function is defined via the weak formulation as

$V(x) = \sup_{I \in A_x} \mathbb{E}\left[ \int_0^T f(X_t, I_t) dt + g(X_T) 1_{\{T < \infty\}} \right], \tag{2.1}$

where $T$ is the exit time from $D$ , and $A_x$ denotes the admissible control processes (allowing the probability space to depend on $I$ ).

A Markov policy $T : S \to A$ prescribes $I_t = T(X_t)$ ; not all such $T$ are improvable—in the context of PIA, one restricts to those $T$ for which the corresponding payoff $V^T$ lies in a regularity class $\mathcal{C}$ of sufficiently smooth functions. The set $\mathcal{I}$ of improvable policies becomes the search space for PIA.

The fundamental iterative step is: given any improvable Markov policy $T\in\mathcal{I}$ ,

$T'(x) \in \arg\max_{a\in A} \Big( [\mathcal{L}^a V^T](x) + f(x,a) \Big). \tag{Def. 1}$

This maximization is pointwise in $x$ ; existence of maximizers and regularity of $T'$ are ensured by continuity and compactness assumptions on $A$ and smoothness of the map $(x,a)\mapsto \mathcal{L}^a\phi(x)+f(x,a)$ . The improved policy $T'$ is again Markov and, under suitable conditions, improvable.

PIA then defines a sequence $(T_n)_{n\in\mathbb{N}}$ via $T_{n+1}$ as the improvement of $T_n$ , with $V^{(T_n)}$ the associated value function.

2. Convergence and Monotonic Improvement

Under deterministic analytic assumptions (joint continuity of the Hamiltonian, compactness of $A$ , domain regularity, and regularity-preserving properties of the semigroup), rigorous monotonicity and convergence of the PIA are established. The main claims are:

Monotonicity: For all $x\in S$ and $n\in\mathbb{N}$ ,

$V^{(T_{n+1})}(x) \geq V^{(T_n)}(x).$

Pointwise Convergence: Under further regularity and compactness (Assumptions (As3)–(As5)), for all $x\in S$ ,

$\lim_{n\to\infty} V^{(T_n)}(x) = V(x).$

Policy Convergence: If (As6)-(As8) hold, there exists a subsequence $(T_{n_k})$ that converges uniformly on compacts to an optimal Markov policy $T^*$ , and $V^{(T^*)}=V$ .

A crucial technical requirement is the uniform vanishing residual: $\lim_{k \to \infty} \left[\mathcal{L}^{T_{n_k+1}} V^{(T_{n_k})}(x) + f(x, T_{n_k+1}(x))\right] = 0 \quad \text{uniformly in } x\in D,$ ensuring that improvements in the policy correspond to true Hamiltonian maximization and that the limiting policy satisfies the optimality PDE.

3. Necessity and Role of the Weak Formulation

The work firmly demonstrates that the natural setting for continuous-time PIA is the weak stochastic control framework. In strong formulations, admissible controls and process solutions must be constructed on a fixed probability space. This is insufficient in many continuous-time scenarios. For example, the optimal control $T(x)=\operatorname{sgn}(x)$ may lead to SDEs for which no strong solution exists due to pathwise uniqueness violations.

Weak formulation allows control and solution processes to be defined on varying stochastic bases, enabling the construction of optimal controls for models lacking strong solutions. This is exhibited via detailed examples, including cases where the control set is either finite ( $A=\{-1,1\}$ ) or continuous ( $A=[-1,1]$ ): $dX_t = a\,dV_t,$ in which the weak solution machinery is required to correctly account for the joint law of $(X, a)$ . Thus, the PIA in continuous time relies fundamentally on weak solutions and the broad admissibility class they provide.

4. Diffusion-Type Examples and PDE Structure

The algorithm is particularly tractable in diffusion-controlled systems. Two canonical cases are provided:

Discounted Infinite Horizon Diffusions: With $D=\mathbb{R}^d$ and the generator

$\mathcal{L}^a\varphi(x)=\frac12 a(x,a)^\top H\varphi(x)a(x,a) + p(x,a)^\top \nabla \varphi(x) - a(x,a) \varphi(x),$

and under assumptions ensuring uniform ellipticity, Lipschitz dynamics, and a compact $A$ , all PIA hypotheses are satisfied and convergence follows. The iterative improvement step corresponds to pointwise maximization of the generalized Hamiltonian.

Finite Horizon Diffusions: The state space extends to $(x, t)$ . Given suitable regularity (Lipschitz, bounded terminal cost), PIA remains applicable by the same analytical reasoning.

In both settings, the iterative step

$T_{n+1}(x) \in \arg\max_{a\in A} \left\{ \mathcal{L}^a V^{(T_n)}(x) + f(x,a) \right\}$

interprets as iteratively approximating the solution to the Hamilton–Jacobi–Bellman (HJB) equation, using smoothness properties of $\mathcal{L}^a$ and $f(x,a)$ .

5. Algorithmic Properties, Residuals, and Optimality

The analytic machinery ensures that the residual

$\left[ \mathcal{L}^{T_{n+1}} V^{(T_n)}(x) + f(x, T_{n+1}(x)) \right]$

shrinks uniformly along a convergent subsequence, and that the limiting policy $T^*$ is a pointwise maximizer: $T^*(x) \in \arg\max_{a\in A} \left\{ \mathcal{L}^a V(x) + f(x,a) \right\}.$ If the process is sufficiently regular, $V$ solves the HJB PDE and $T^*$ is optimal in the sense that $V^{(T^*)}=V$ .

This architecture ensures that the algorithm recovers the optimal value function and a corresponding Markov control policy, provided the initial policy and system data satisfy the regularity and compactness assumptions.

6. Broader Impact, Limitations, and Significance

The paper demonstrates that the continuous-time PIA, under deterministic analytic conditions on the data (regularity, compactness, uniform ellipticity), is

Well-posed: The algorithm is defined without time discretization or indirect limit arguments.
Convergent: The value/payoff sequence converges monotonically to the optimal value function; policies converge to an optimal Markov policy on compact subsets.
General: The framework accommodates controlled diffusions, potentially degenerate, over general domains and with general running/terminal costs.

The emphasis on the weak formulation generalizes the class of admissible controls, permitting solutions in scenarios where the strong formulation breaks down (notably when natural optimal policies cannot be realized via solutions to strongly posed SDEs).

A further significance is the compatibility with PDE methods—PIA can be interpreted as a dynamic programming method that produces classical solutions to HJB equations when smoothness permits, and aligns with contemporary perspectives on verification theorems and viscosity solution theory.

The extension of PIA to continuous-time, weakly-controlled, and diffusion-based problems bridges the theory of dynamic programming, stochastic process analysis, and PDE methods for control. This framework provides a robust theoretical underpinning for practical policy improvement implementations in continuous domains, supplies tools for verification by linking iterated policy improvement to convergence of residuals in HJB-type PDEs, and presents analytic conditions that can be verified using the deterministic structure of the model.

The methodology’s flexibility is evidenced by its applicability to problems outside classical frameworks (e.g., those with singularities in SDEs, boundary exit phenomena, general reward structures) and its relevance for modern computational and reinforcement learning settings in high-dimensional, continuous-time domains. The PIA as developed in (Jacka et al., 2015) forms a cornerstone for further research on continuous-time reinforcement learning and the theory of controlled diffusions.

Markdown Report Issue Upgrade to Chat

References (1)

On the policy improvement algorithm in continuous time (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Policy Improvement Algorithm (PIA).