Papers
Topics
Authors
Recent
Search
2000 character limit reached

Growing Policy Optimization (GPO)

Updated 29 January 2026
  • Growing Policy Optimization is a reinforcement learning framework that incrementally expands the effective action space to stabilize exploration in high-dimensional, torque-based control tasks.
  • It employs a time-varying, differentiable transformation using a Gompertz curve to ensure reliable gradient estimation while preserving standard PPO loss objectives.
  • Empirical evaluations show that GPO significantly improves sim-to-real robustness and control precision in quadruped and hexapod locomotion compared to baseline PPO.

Growing Policy Optimization (GPO) is a reinforcement learning (RL) training framework designed specifically for high-dimensional, continuous control problems in legged robotics. It addresses fundamental limitations in policy optimization for torque-based control, where exploration of the action space and reliable gradient estimation are particularly challenging. GPO leverages a time-varying, differentiable transformation of policy actions, initially restricting the effective action space and gradually expanding it as training progresses. This mechanism yields environments with more informative gradients and stabilized exploration, facilitating both faster convergence and improved final performance. The framework is shown to preserve standard policy optimization objectives and introduces only bounded, diminishing distortion to gradient estimates, supporting stable and theoretically justified training improvements in both simulated and real robot deployments (Liao et al., 28 Jan 2026).

1. Formal Structure of the Growing Policy Optimization Framework

GPO operates by transforming the latent outputs of the policy via a parameterized action mapping. If the raw policy generates aN(μθ(s),σ2)a \sim \mathcal{N}\bigl(\mu_\theta(s),\,\sigma^2\bigr), GPO introduces a monotonic scheduling of the action range parameter βt=βmaxf(t)\beta_t = \beta_{\max} f(t), where f(t)f(t) is typically implemented as a Gompertz curve, f(t)=exp(ek(tt0))f(t)=\exp(-e^{-k(t-t_0)}), with parameters kk and t0t_0 that control plateau duration and growth rate.

The core transformation is given by

Tt(a)=a~=βttanh(aβt),a~[βt,βt].T_t(a) = \tilde{a} = \beta_t \tanh\left(\frac{a}{\beta_t}\right), \quad \tilde{a} \in [-\beta_t, \beta_t].

For small aβt|a|\ll\beta_t, Tt(a)aT_t(a) \approx a; as training continues and βtβmax\beta_t \to \beta_{\max}, the transformation recovers the conventional action clipping typically applied in RL.

This growing range constrains actions early (improving stability and data collection efficiency) and allows increasingly aggressive exploration as policies mature.

2. Policy Objective Preservation and Update Mechanisms

GPO specifically modifies the actions observed by the environment. However, when integrating with the Proximal Policy Optimization (PPO) algorithm, it leaves the clipped surrogate loss function unchanged up to a variable substitution: rt(θ)=πθ(a~tst)πθold(a~tst)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(\tilde{a}_t|s_t)}{\pi_{\theta_{\rm old}}(\tilde{a}_t|s_t)} = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\rm old}}(a_t|s_t)} The Jacobian term dada~|\frac{da}{d\tilde{a}}| cancels in the likelihood ratio, ensuring the PPO loss

LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\rm CLIP}(\theta) = \mathbb{E}_t\Big[\min(r_t(\theta)\hat{A}_t, \operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\Big]

retains its standard form and properties.

3. Algorithmic Implementation and Typical Hyperparameters

The GPO procedure integrates directly into PPO workflows. The update loop incorporates time-dependent calculation of βt\beta_t and applies the Tt()T_t(\cdot) transformation before policy execution. A representative pseudocode is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
initialize θ  θ
for epoch=1N_epochs:
  for t=1Timesteps_per_epoch:
    compute β_t from schedule
    sample a_t  𝒩(μ_θ(s_t),σ²)
    compute transformed action ṡ_t  β_t·tanh(a_t/β_t)
    execute ṡ_t in env  s_{t+1}, r_t
    store (s_t,a_t,ṡ_t,r_t)
  end
  compute advantages Ĥ A_t
  for update=1K:
    sample minibatch of size B
    compute r(θ)=π_θ(ṡ)/π_{θ_old}(ṡ)
    compute L^{CLIP}(θ) and _θL
    θ  θ + η _θL
  end
end

Key parameter choices include learning rate η[3×104,1×103]\eta \in [3\times10^{-4},\,1\times10^{-3}], PPO clip ϵ=0.2\epsilon=0.2, and Gompertz scheduling with k3×105k\sim 3\times10^{-5}, t02.4×104t_0\sim 2.4\times10^4, and βmin/βmax0.1\beta_{\min}/\beta_{\max}\approx0.1.

4. Theoretical Properties and Convergence Analysis

The GPO framework offers formal guarantees concerning gradient distortion, variance, and asymptotic performance.

Proposition 1 (Gradient-Distortion Bound): If a/βt0.5|a/\beta_t| \le 0.5, the difference in score function gradients is controlled: θlogπθ(a~)θlogπθ(a)Cθμθ(s),\bigl\|\nabla_\theta\log\pi_\theta(\tilde a) - \nabla_\theta\log\pi_\theta(a)\bigr\| \le C\,\bigl\|\nabla_\theta\mu_\theta(s)\bigr\|, with C=sinh(1)12σ2βmaxβt0C = \frac{\sinh(1)-1}{2\,\sigma^2}|\beta_{\max}-\beta_t| \to 0 as βtβmax\beta_t \to \beta_{\max}.

Corollary 1 (Gradient Variance and SNR): For gt=θlogπθ(a~t)Atg_t=\nabla_\theta\log\pi_\theta(\tilde{a}_t)A_t, $\E[A_t^2]=\sigma_A^2$, and θμK0\|\nabla_\theta\mu\|\le K_0: Var[gt]cβt2,c=σA2(2K0σ2)2\operatorname{Var}[g_t]\le c\beta_t^2, \qquad c = \sigma_A^2 \left(\frac{2 K_0}{\sigma^2}\right)^2 yielding signal-to-noise ratio SNR1βt\operatorname{SNR} \propto \frac{1}{\beta_t}.

Theorem 1 (Early-Stage Convergence): Under local μ\mu-strong convexity and LL-smoothness,

$\E\|\theta_t - \theta^*\|^2 \le (1-\eta\mu)^t\|\theta_0 - \theta^*\|^2 + \frac{\eta}{\mu}c \beta_t^2$

implies that small βt\beta_t fosters tight proximity to optimal parameters θ\theta^* during early optimization.

Theorem 2 (No-Worse Asymptotics): In the quadratic regime,

$\limsup_{t\to\infty}\E\|\theta_t-\theta^*\|^2 \le \frac{\eta^2 c}{1-\rho} \beta_\infty^2$

with ββmax\beta_\infty\le\beta_{\max}, hence

$\liminf_{t\to\infty}\E[J_{\rm GPO}]\ge\liminf_{t\to\infty}\E[J_{\rm baseline}].$

This suggests that GPO policies are guaranteed to match or surpass the asymptotic performance of baseline PPO in the local regime.

5. Practical Guidance for Real-World Implementation

When deploying GPO on torque-controlled robots, it is essential to preserve the differentiable tanh\tanh mapping throughout training and hardware transfer, rather than reverting to hard clipping, to maintain gradient propagation and stability. Initial ranges βt\beta_t should be set at 10–20% of the system's hardware torque limits to facilitate safe exploration of contact-rich scenarios; these are increased progressively to enable higher-magnitude behaviors. For controllers operating in position space, the transformation applies identically to velocity or position increment commands.

Transitioning to real hardware is most robust when maintaining the same action-mapping regime used during simulation, avoiding abrupt discontinuities in the policy's output transformation.

6. Empirical Results and Performance Benchmarks

Experimental evaluation of GPO encompasses quadruped and hexapod locomotion tasks, testing whole-body control (6 DoF base, 12 or 18 joint torques) and sim-to-real transfer.

Simulation Tracking Error Table:

Metric GPO PPO
Quadruped vxv_x 0.015 ± 0.035 m/s –0.30 ± 0.10 m/s
Quadruped vyv_y 0.00 ± 0.05 m/s –0.10 ± 0.05 m/s
Body Height 0.005 ± 0.0025 m –0.03 ± 0.02 m

Hexapod velocity errors are similarly reduced by ≈0.1 m/s under GPO.

Policies trained with GPO exhibit regular, periodic joint torques and velocities, in contrast to baseline PPO which tends toward irregular outputs and suboptimal leg utilization.

Zero-Shot Hardware Robustness (Success Rate, 10 trials):

Task GPO PPO
Side push 100% 0%
Challenging terrain 100% 20%
Vertical stomp 100% 40%
Torque-based DeCAP 0–60%
Position-based DeCAP 40–80%

This suggests that GPO substantially improves sim-to-real robustness, enabling consistent success in perturbation-rich hardware evaluations where PPO-trained agents typically fail.

7. Interpretations and Implications

The GPO framework provides a general and environment-agnostic solution to the exploration—gradient estimation trade-off in RL-based legged locomotion, particularly for torque-level control. Its theoretical guarantees ensure stable policy updates, faster convergence in early training, and equal or better long-term performance relative to baseline PPO. A plausible implication is that GPO's approach can generalize to other high-dimensional continuous control domains where exploration and stability are challenged by complex action spaces and hardware constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Growing Policy Optimization (GPO).