Receding-Horizon Policy Gradient for Polytopic Controller Synthesis

Published 31 Mar 2026 in eess.SY and math.OC | (2603.29283v1)

Abstract: We propose the Polytopic Receding-Horizon Policy Gradient (P-RHPG) algorithm for synthesizing Parallel Distributed Compensation (PDC) controllers via Tensor Product (TP) model transformation. Standard LMI-based PDC synthesis grows increasingly conservative as model fidelity improves; P-RHPG instead solves a finite-horizon integrated cost via backward-stage decomposition. The key result is that each stage subproblem is a strongly convex quadratic in the vertex gains, a consequence of the linear independence of the HOSVD weighting functions, guaranteeing a unique global minimizer and linear convergence of gradient descent from any initialization. With zero terminal cost, the optimal cost increases monotonically to a finite limit and the gain sequence remains bounded; terminal costs satisfying a mild Lyapunov condition yield non-increasing convergence. Experiments on an aeroelastic wing benchmark confirm convergence to a unique infinite-horizon optimum across all tested terminal cost choices and near-optimal performance relative to the pointwise Riccati lower bound.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a receding-horizon policy gradient method that optimizes polytopic controller synthesis for qLPV systems.
It decomposes the integrated finite-horizon cost into strongly convex, stage-wise subproblems, guaranteeing unique global minimizers and efficient convergence.
Empirical results on a 2-DoF aeroelastic wing problem demonstrate near-optimal performance and robust stabilization compared to LMI-based methods.

Receding-Horizon Policy Gradient for Polytopic Controller Synthesis

Introduction and Motivation

The synthesis of controllers for nonlinear dynamical systems operating across wide parameter ranges remains a central challenge, especially when pursuing systematic, non-conservative, and computationally scalable formulations. The quasi-linear parameter-varying (qLPV) framework delivers an exact embedding of nonlinear dynamics into parameterized linear systems, where polytopic representations via Tensor Product (TP) model transformation—with encodable complexity via Higher-Order Singular Value Decomposition (HOSVD)—have become the standard for real-world applications. The prevalent Parallel Distributed Compensation (PDC) architecture ensures a convex blending of local vertex controllers using the same polytopic weighting functions as the plant, enabling global nonlinear control using local linear design techniques.

Historically, PDC synthesis relies on linear matrix inequality (LMI)-based stability analysis, which mandates a common Lyapunov function. This introduces conservatism that exacerbates with increased model fidelity (i.e., number of polytopic vertices), and does not directly address performance optimization. Recent advances in policy optimization for quadratic control (notably LQR and its robust/risk-sensitive variants) have established a theoretical foundation for policy-gradient-driven synthesis with global convergence and performance guarantees—however, the integration of these methodologies into polytopic control has been absent. The current work presents an algorithmic and theoretical framework that fills this gap.

Polytopic System Formulation and PDC Parameterization

The starting point is a discrete-time nonlinear system $x_{t+1} = f(x_t, u_t)$ admitting an exact qLPV representation,

$x_{t+1} = A(p_t)x_t + B(p_t)u_t,$

with the scheduling parameter $p_t$ evolving within a compact set $\Omega$ . Through the TP model transformation, $(A(p), B(p))$ are re-expressed as convex combinations of vertex matrices,

$\begin{bmatrix} A(p) & B(p) \end{bmatrix} = \sum_{i=1}^V \alpha_i(p) \begin{bmatrix} A_i & B_i \end{bmatrix},$

where the weighting functions $\alpha_i(p)$ are SNNN, linearly independent (guaranteed by HOSVD), and afford adjustable fidelity via their rank. The PDC controller matches this polytopic blending,

$K(p) = \sum_{i=1}^V\alpha_i(p) K_i,$

so the design variables are the set of vertex gains $\{K_i\}$ , under the constraint that the controller structure is inherited and fixed by the plant transformation.

However, the resulting closed-loop system introduces cross-terms due to the shared weighting functions—i.e., the closed-loop matrix contains $\sum_{i,j}\alpha_i(p)\alpha_j(p)(A_i + B_iK_j)$ —and, crucially, joint stability cannot be certified purely by vertex-wise stability.

Integrated Finite-Horizon Cost and Optimization Problem

To evaluate and optimize closed-loop performance, the cost function is defined as the expected finite-horizon quadratic cost over the parameter space, averaging with respect to a measure $x_{t+1} = A(p_t)x_t + B(p_t)u_t,$ 0 (usually uniform on $x_{t+1} = A(p_t)x_t + B(p_t)u_t,$ 1). With a horizon $x_{t+1} = A(p_t)x_t + B(p_t)u_t,$ 2 and cost matrices $x_{t+1} = A(p_t)x_t + B(p_t)u_t,$ 3, the cost for fixed $x_{t+1} = A(p_t)x_t + B(p_t)u_t,$ 4 and gain sequence $x_{t+1} = A(p_t)x_t + B(p_t)u_t,$ 5 is

$x_{t+1} = A(p_t)x_t + B(p_t)u_t,$ 6

with the global objective

$x_{t+1} = A(p_t)x_t + B(p_t)u_t,$ 7

A key aspect is the "frozen-parameter" assumption: at each optimization stage, the scheduling variable $x_{t+1} = A(p_t)x_t + B(p_t)u_t,$ 8 is held constant, yielding a collection of decoupled LTI systems, and bypassing the technical complications of dynamic scheduling.

The resulting optimization yields a policy search over all vertex gain sequences, without explicit stability constraints, as the finite cost for any gain is always well-defined.

Backward-Stage Decomposition and Structural Results

The main theoretical innovation is the decomposition of the high-dimensional policy optimization into a chain of $x_{t+1} = A(p_t)x_t + B(p_t)u_t,$ 9 backward stages, each a strongly convex quadratic problem in the current set of vertex gains $p_t$ 0. This reduction is enabled by:

The cost-to-go recursion separates the decision on $p_t$ 1 from future stages.
The integrand is a quadratic in $p_t$ 2 due to the linear-in-the-parameters structure.
The Gram matrix of the weighting functions (with respect to measure $p_t$ 3) is positive definite due to HOSVD-based construction, ensuring strict convexity.

Each stage's Hessian is block-structured, with strong convexity guaranteeing existence and uniqueness of a global minimizer and enabling the application of direct solvers or linearly convergent gradient descent from arbitrary initialization. Explicit formulae for the gradient and residuals are presented, showing that the optimal gain acts as a weighted projection of the pointwise Riccati gain within the PDC subspace.

P-RHPG Algorithm: Structure, Complexity, and Convergence

P-RHPG (Polytopic Receding-Horizon Policy Gradient) proceeds via a finite-horizon backward sweep: at each stage, the unique stage-optimal gains are computed (either via solving a linear system or gradient descent), the cost-to-go is updated, and the process recurses to stage zero. Only the first-stage gains are deployed as the (stationary) PDC controller—mirroring the receding-horizon control paradigm.

The computational complexity is dominated by assembling the Hessian (scaling quadratically in the number of quadrature points and polytopic vertices) and solving a cubic-size system per stage, scaling linearly with horizon $p_t$ 4.

Monotonic cost convergence and gain boundedness are formally proven for zero terminal cost, and a general "squeeze" argument shows that for any terminal cost satisfying a mild Lyapunov condition, convergence to a unique integrated optimum occurs. The analysis illuminates how the structure and conditioning of the weighting Gram matrix interpolate between convergence speed and representational fidelity.

Empirical Validation

Experiments focus on the robust flutter control problem for a 2-DoF aeroelastic wing system with three scheduling parameters. Using a range of grid resolutions (number of vertices), P-RHPG is benchmarked against classical LMI-based PDC synthesis:

Convergence: Across a variety of terminal costs, P-RHPG achieves monotonic and bounded convergence to a unique infinite-horizon cost (squeeze gap on the order of $p_t$ 5 for large horizon), even when open-loop systems are unstable.
Performance: P-RHPG achieves costs within $p_t$ 6 of the pointwise Riccati lower bound at fine grid resolution, significantly outperforming LMI-PDC, which is often infeasible for higher vertex counts and, when feasible, displays $p_t$ 7 suboptimality and poor transients.
Stabilization: Joint stability of the closed-loop is achieved by P-RHPG over the entire parameter space without imposing explicit constraints, whereas LMI-PDC fails beyond coarse discretizations.
Algorithmic confirmation: Stagewise gradient descent converges linearly with rates dictated by theoretical conditioning, supporting the analytic results.

Theoretical and Practical Implications

The theoretical innovation is the recognition that polytopic PDC controller synthesis admits a stagewise decomposition equipped with strong convexity and unique minimizer properties, making policy gradient approaches not just feasible but also computationally favorable relative to LMI-based methods. This disconnects the synthesis tractability from LMI conservatism, enabling high-fidelity polytopic models without incurring infeasibility.

Practically, this framework permits efficient, scalable, and near-optimal controller synthesis in applications with high-dimensional scheduling, provided a tractable (possibly sparse) quadrature scheme is used. The architecture is extensible to model-free settings, output feedback, distributed multi-agent systems, and potentially nonlinear performance objectives.

The open theoretical issue remains the analytic proof of the stationarity of time-varying gains in the infinite-horizon limit for general polytopic structures—a PDC analogue of Riccati convergence.

Conclusion

This paper presents the Polytopic Receding-Horizon Policy Gradient algorithm, which unifies direct policy optimization with polytopic controller synthesis in qLPV systems. By leveraging the inherent strong convexity in the decomposed stage cost, the method ensures unique global convergence and enables high-fidelity controller synthesis without LMI-induced conservatism. Theoretical guarantees and empirical results substantiate the effectiveness and scalability of P-RHPG, establishing a foundation for further developments in high-dimensional and model-free control design for nonlinear systems (2603.29283).

Markdown Report Issue