Policy Gradient Adaptive Control

Updated 13 January 2026

Policy Gradient Adaptive Control (PGAC) is a class of adaptive control algorithms that uses policy gradients to update control policies based solely on real-time system data.
It integrates actor-critic architectures, stochastic approximation, and hybrid formulations to address nonstationarity, evolving dynamics, and diverse control objectives.
PGAC methods have been validated in applications such as autonomous vehicles and power systems, demonstrating convergence, stability, and safety in challenging environments.

Policy Gradient Adaptive Control (PGAC) refers to a broad class of adaptive control algorithms that use policy gradient methods to update control policies in real time, aiming to optimize closed-loop performance based solely on data collected from the system. These algorithms are applicable in finite and continuous Markov decision processes (MDPs), and are particularly suited to control tasks that involve nonstationarity, changing dynamics, or additional objectives not available during offline training. PGAC integrates actor-critic architectures, stochastic approximation, and advanced policy update rules—including direct, indirect, and hybrid formulations—and incorporates convergence, stability, and safety analyses rooted in both reinforcement learning and classical adaptive control.

1. Formal Problem Setting and General Approach

In the canonical setting, PGAC operates within an MDP $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, t, r, \gamma \rangle$ , where $\mathcal{S}$ and $\mathcal{A}$ are (finite or continuous) state and action spaces, $t$ is the (possibly unknown) transition kernel, $r$ is the reward (or negative cost) function, and $\gamma \in (0,1)$ is the discount factor. The goal is to adaptively tune a parameterized policy $\pi_\theta(a|s)$ to minimize an expected cumulative cost (or maximize reward), often in the presence of unknown or time-varying system dynamics.

PGAC techniques use online samples $(s_t, a_t, r_t, s_{t+1})$ to incrementally update both the value function approximation ("critic") and the policy parameters ("actor") via a principled policy gradient, which is typically estimated with respect to a performance objective such as mean-squared projected Bellman error (MSPBE) or infinite-horizon LQR cost in the linear case (Lehnert et al., 2015, Zhao et al., 6 May 2025, Laurent et al., 6 Jan 2026).

2. Algorithmic Structures: Direct, Indirect, and Hybrid Forms

PGAC encompasses several algorithm variants:

Indirect PGAC estimates system dynamics from data (e.g., via online least-squares) and computes policy gradients with respect to the estimated model. For linear systems, gradient expressions involve policy-evaluation Lyapunov equations and take the form

$\nabla_K J(K) = 2 [(R+B^\top P B)K - B^\top P A]\Sigma,$

where $A,B$ are estimated, $\mathcal{S}$ 0 and $\mathcal{S}$ 1 solve algebraic Lyapunov equations, and $\mathcal{S}$ 2 is the LQR cost (Zhao et al., 6 May 2025, Laurent et al., 6 Jan 2026).

Direct PGAC computes gradients using empirical covariances of observed state and control trajectories, circumventing explicit model identification. Policy parameterization can be based on direct covariance mappings; updates proceed according to a data-driven gradient with possible preconditioning (e.g., natural or Gauss-Newton gradients).
Hybrid/Embedded PGAC variants combine learned (policy-gradient) and classical controllers, with real-time switching or mixing to preserve safety or enforce hard constraints, as seen in car-following systems with DDPG and cooperative adaptive cruise control (CACC) (Yan et al., 2021).

The actor-critic structure is typical: the critic employs fast-timescale temporal difference (TD) or value fitting, and the actor performs slower stochastic gradient ascent steps using the critic for policy evaluation (Lehnert et al., 2015, Hao et al., 2023).

3. Convergence, Stability, and Safety Guarantees

Convergence analyses for PGAC rely on well-studied stochastic approximation frameworks and, in some cases, two-timescale methods. Under standard conditions—ergodicity, smoothness, bounded features, and diminishing step sizes—the coupled actor-critic updates converge almost surely to a local stationary point of the chosen performance objective, such as minimizing MSPBE or LQR cost (Lehnert et al., 2015, Zhao et al., 6 May 2025, Laurent et al., 6 Jan 2026).

Critical stability guarantees are established for linear and switched linear systems:

Strong Sequential Stability: If the system is persistently excited and the policy adapts slowly enough, the state remains bounded with an explicit exponential-plus-bias bound, even under unknown, switching system parameters.
Finite-Time Tracking: In switched systems, stability and cost contraction are ensured by enforcing a minimum dwell time and controlling the step size so that each mode's optimality gap is contracted before the next switch (Laurent et al., 6 Jan 2026).

For policy-gradient based control in safety-critical or physically deployed systems, monotonic return-increase certificates are established via smoothness and concentration bounds, enabling step-size and batch-size selection that enforces $\mathcal{S}$ 3 with high probability, in the sense of conservative/monotonic adaptive laws in classical control (Papini et al., 2019).

4. PGAC Algorithmic Elements and Implementation Workflow

A typical PGAC implementation features the following steps (component variations are domain-dependent):

Initialization:
- Initialize $\mathcal{S}$ 4 (policy parameters), $\mathcal{S}$ 5 (value function), and any system model or replay buffer as appropriate.
Data Collection and Processing:
- Execute $\mathcal{S}$ 6 exploration noise; observe $\mathcal{S}$ 7.
- For indirect PGAC, update system identification estimates; for direct PGAC, update empirical covariances.
Critic Update:
- TD-style update ( $\mathcal{S}$ 8) or value-fitting; in continuous control, minimize squared Bellman error using batches sampled from the buffer.
Actor Update:
- Apply sampled policy gradient (potentially incorporating corrections for distributional drift and importance weighting in off-policy setups).
- In the LQR case, descend the cost function via (optionally regularized) gradients—plain, natural, or Gauss-Newton.
Safety and Adaptation:
- Adjust step sizes or batch sizes adaptively to enforce monotonic improvement or track model switches.
- For hybrid setups, implement real-time controller arbitration and safety checks with provable constraint satisfaction (Yan et al., 2021, Papini et al., 2019).

5. Applications and Empirical Results

PGAC methods have been validated on diverse control and optimization tasks:

Classical Control: Linear quadratic regulators with online disturbance, switching, and unknown parameters (Zhao et al., 6 May 2025, Laurent et al., 6 Jan 2026).
Nonlinear and Robotic Systems: Adaptive tuning to secondary objectives (e.g., obstacle avoidance or task switching) while retaining performance on a primary control objective, with complexity and sample efficiency analyses demonstrating $\mathcal{S}$ 9 convergence in gradient norm and $\mathcal{A}$ 0 sample complexity (Hao et al., 2023).
Hybrid/Embedded Control: Autonomous vehicles (car-following), where DDPG and model-based CACC are combined to ensure stability, low jerk, and robust tracking under varying conditions (Yan et al., 2021).
Power Systems: EMT-in-the-loop adaptive control of sub-synchronous oscillations in DFIG wind farms, utilizing filtered waveform features and direct policy-gradient learning for real-time gain tuning (Mukherjee et al., 8 Nov 2025).
Black-Box Optimization: Adaptive parameter control in evolutionary algorithms via an LSTM-based, REINFORCE-updated policy, which consistently outperforms static or heuristically scheduled alternatives on CEC'13, CEC'17 benchmarks (Sun et al., 2021).

Empirical evaluations consistently show that PGAC-type controllers achieve stable, convergent, and high-performing adaptation, often surpassing classic non-adaptive or non-gradient methods in both performance and robustness.

6. Connections with Reinforcement Learning and Adaptive Control

PGAC integrates principles from both modern reinforcement learning and established adaptive control.

From RL: policy gradient methods, actor-critic architectures, importance sampling for off-policy adaptation, batch-based sample updates, and handling function approximation (Lehnert et al., 2015, Papini et al., 2019).
From adaptive control: persistence of excitation to guarantee parameter identifiability, strong and sequential stability notions, monotonic adaptation (Lyapunov-based guarantees), and explicit tracking of time-varying system modes (Laurent et al., 6 Jan 2026, Zhao et al., 6 May 2025).
Safe exploration and adaptation parallels MRAC, robust/adaptive MPC, and Lyapunov-constrained adaptation, re-framed with return-based online performance certificates (Papini et al., 2019).

This confluence enables PGAC to deliver data-driven control schemes with guarantees on performance improvement, stability, and adaptability, even in the presence of model uncertainty, disturbances, and nonstationarity.

7. Limitations and Practical Considerations

PGAC's principled structure is subject to several technical prerequisites:

Persistent Excitation: Excitation noise or exploration is essential to ensure convergence of model or covariance estimates, especially in indirect/direct LQR variants.
Step-Size and Batch Optimization: Step sizes must be chosen to balance adaptation rate and stability; methods exist for adaptive tuning of these meta-parameters to enforce safety (Papini et al., 2019).
Computation: Critical routines (e.g., Lyapunov/Riccati solvers, covariance update, empirical moment computation) must be efficiently implemented, especially for high-dimensional or real-time systems (Zhao et al., 6 May 2025).
Practical Deployment: Sensor bandwidth, buffer management, and critic-actor learning-rate scheduling are crucial, as is filtering or preprocessing for signal-based adaptation (e.g., in power electronics applications) (Mukherjee et al., 8 Nov 2025).

For nonlinear systems or highly nonstationary environments, additional extensions (forgetting factors, pruning, deep networks, or multi-step adaptation) may be required to sustain performance.

References (by arXiv id):

(Lehnert et al., 2015) Lehnert & Precup, "Policy Gradient Methods for Off-policy Control"
(Papini et al., 2019) Pirotta et al., "Smoothing Policies and Safe Policy Gradients"
(Hao et al., 2023) Liu et al., "Adaptive Policy Learning to Additional Tasks"
(Yan et al., 2021) Yan et al., "Hybrid Car-Following Strategy based on Deep Deterministic Policy Gradient and Cooperative Adaptive Cruise Control"
(Laurent et al., 6 Jan 2026) Laurent et al., "Adaptive Control of Unknown Linear Switched Systems via Policy Gradient Methods"
(Zhao et al., 6 May 2025) Guo et al., "Policy Gradient Adaptive Control for the LQR: Indirect and Direct Approaches"
(Mukherjee et al., 8 Nov 2025) Shenoy et al., "Policy Gradient-Based EMT-in-the-Loop Learning to Mitigate Sub-Synchronous Control Interactions"
(Sun et al., 2021) Li et al., "Learning adaptive differential evolution algorithm from optimization experiences by policy gradient"