Growing Policy Optimization (GPO)
- Growing Policy Optimization is a reinforcement learning framework that incrementally expands the effective action space to stabilize exploration in high-dimensional, torque-based control tasks.
- It employs a time-varying, differentiable transformation using a Gompertz curve to ensure reliable gradient estimation while preserving standard PPO loss objectives.
- Empirical evaluations show that GPO significantly improves sim-to-real robustness and control precision in quadruped and hexapod locomotion compared to baseline PPO.
Growing Policy Optimization (GPO) is a reinforcement learning (RL) training framework designed specifically for high-dimensional, continuous control problems in legged robotics. It addresses fundamental limitations in policy optimization for torque-based control, where exploration of the action space and reliable gradient estimation are particularly challenging. GPO leverages a time-varying, differentiable transformation of policy actions, initially restricting the effective action space and gradually expanding it as training progresses. This mechanism yields environments with more informative gradients and stabilized exploration, facilitating both faster convergence and improved final performance. The framework is shown to preserve standard policy optimization objectives and introduces only bounded, diminishing distortion to gradient estimates, supporting stable and theoretically justified training improvements in both simulated and real robot deployments (Liao et al., 28 Jan 2026).
1. Formal Structure of the Growing Policy Optimization Framework
GPO operates by transforming the latent outputs of the policy via a parameterized action mapping. If the raw policy generates , GPO introduces a monotonic scheduling of the action range parameter , where is typically implemented as a Gompertz curve, , with parameters and that control plateau duration and growth rate.
The core transformation is given by
For small , ; as training continues and , the transformation recovers the conventional action clipping typically applied in RL.
This growing range constrains actions early (improving stability and data collection efficiency) and allows increasingly aggressive exploration as policies mature.
2. Policy Objective Preservation and Update Mechanisms
GPO specifically modifies the actions observed by the environment. However, when integrating with the Proximal Policy Optimization (PPO) algorithm, it leaves the clipped surrogate loss function unchanged up to a variable substitution: The Jacobian term cancels in the likelihood ratio, ensuring the PPO loss
retains its standard form and properties.
3. Algorithmic Implementation and Typical Hyperparameters
The GPO procedure integrates directly into PPO workflows. The update loop incorporates time-dependent calculation of and applies the transformation before policy execution. A representative pseudocode is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
initialize θ ← θ₀ for epoch=1…N_epochs: for t=1…Timesteps_per_epoch: compute β_t from schedule sample a_t ∼ 𝒩(μ_θ(s_t),σ²) compute transformed action ṡ_t ← β_t·tanh(a_t/β_t) execute ṡ_t in env → s_{t+1}, r_t store (s_t,a_t,ṡ_t,r_t) end compute advantages Ĥ A_t for update=1…K: sample minibatch of size B compute r(θ)=π_θ(ṡ)/π_{θ_old}(ṡ) compute L^{CLIP}(θ) and ∇_θL θ ← θ + η ∇_θL end end |
Key parameter choices include learning rate , PPO clip , and Gompertz scheduling with , , and .
4. Theoretical Properties and Convergence Analysis
The GPO framework offers formal guarantees concerning gradient distortion, variance, and asymptotic performance.
Proposition 1 (Gradient-Distortion Bound): If , the difference in score function gradients is controlled: with as .
Corollary 1 (Gradient Variance and SNR): For , $\E[A_t^2]=\sigma_A^2$, and : yielding signal-to-noise ratio .
Theorem 1 (Early-Stage Convergence): Under local -strong convexity and -smoothness,
$\E\|\theta_t - \theta^*\|^2 \le (1-\eta\mu)^t\|\theta_0 - \theta^*\|^2 + \frac{\eta}{\mu}c \beta_t^2$
implies that small fosters tight proximity to optimal parameters during early optimization.
Theorem 2 (No-Worse Asymptotics): In the quadratic regime,
$\limsup_{t\to\infty}\E\|\theta_t-\theta^*\|^2 \le \frac{\eta^2 c}{1-\rho} \beta_\infty^2$
with , hence
$\liminf_{t\to\infty}\E[J_{\rm GPO}]\ge\liminf_{t\to\infty}\E[J_{\rm baseline}].$
This suggests that GPO policies are guaranteed to match or surpass the asymptotic performance of baseline PPO in the local regime.
5. Practical Guidance for Real-World Implementation
When deploying GPO on torque-controlled robots, it is essential to preserve the differentiable mapping throughout training and hardware transfer, rather than reverting to hard clipping, to maintain gradient propagation and stability. Initial ranges should be set at 10–20% of the system's hardware torque limits to facilitate safe exploration of contact-rich scenarios; these are increased progressively to enable higher-magnitude behaviors. For controllers operating in position space, the transformation applies identically to velocity or position increment commands.
Transitioning to real hardware is most robust when maintaining the same action-mapping regime used during simulation, avoiding abrupt discontinuities in the policy's output transformation.
6. Empirical Results and Performance Benchmarks
Experimental evaluation of GPO encompasses quadruped and hexapod locomotion tasks, testing whole-body control (6 DoF base, 12 or 18 joint torques) and sim-to-real transfer.
Simulation Tracking Error Table:
| Metric | GPO | PPO |
|---|---|---|
| Quadruped | 0.015 ± 0.035 m/s | –0.30 ± 0.10 m/s |
| Quadruped | 0.00 ± 0.05 m/s | –0.10 ± 0.05 m/s |
| Body Height | 0.005 ± 0.0025 m | –0.03 ± 0.02 m |
Hexapod velocity errors are similarly reduced by ≈0.1 m/s under GPO.
Policies trained with GPO exhibit regular, periodic joint torques and velocities, in contrast to baseline PPO which tends toward irregular outputs and suboptimal leg utilization.
Zero-Shot Hardware Robustness (Success Rate, 10 trials):
| Task | GPO | PPO |
|---|---|---|
| Side push | 100% | 0% |
| Challenging terrain | 100% | 20% |
| Vertical stomp | 100% | 40% |
| Torque-based DeCAP | 0–60% | — |
| Position-based DeCAP | 40–80% | — |
This suggests that GPO substantially improves sim-to-real robustness, enabling consistent success in perturbation-rich hardware evaluations where PPO-trained agents typically fail.
7. Interpretations and Implications
The GPO framework provides a general and environment-agnostic solution to the exploration—gradient estimation trade-off in RL-based legged locomotion, particularly for torque-level control. Its theoretical guarantees ensure stable policy updates, faster convergence in early training, and equal or better long-term performance relative to baseline PPO. A plausible implication is that GPO's approach can generalize to other high-dimensional continuous control domains where exploration and stability are challenged by complex action spaces and hardware constraints.