Growing Policy Optimization (GPO)

Updated 29 January 2026

Growing Policy Optimization is a reinforcement learning framework that incrementally expands the effective action space to stabilize exploration in high-dimensional, torque-based control tasks.
It employs a time-varying, differentiable transformation using a Gompertz curve to ensure reliable gradient estimation while preserving standard PPO loss objectives.
Empirical evaluations show that GPO significantly improves sim-to-real robustness and control precision in quadruped and hexapod locomotion compared to baseline PPO.

Growing Policy Optimization (GPO) is a reinforcement learning (RL) training framework designed specifically for high-dimensional, continuous control problems in legged robotics. It addresses fundamental limitations in policy optimization for torque-based control, where exploration of the action space and reliable gradient estimation are particularly challenging. GPO leverages a time-varying, differentiable transformation of policy actions, initially restricting the effective action space and gradually expanding it as training progresses. This mechanism yields environments with more informative gradients and stabilized exploration, facilitating both faster convergence and improved final performance. The framework is shown to preserve standard policy optimization objectives and introduces only bounded, diminishing distortion to gradient estimates, supporting stable and theoretically justified training improvements in both simulated and real robot deployments (Liao et al., 28 Jan 2026).

1. Formal Structure of the Growing Policy Optimization Framework

GPO operates by transforming the latent outputs of the policy via a parameterized action mapping. If the raw policy generates $a \sim \mathcal{N}\bigl(\mu_\theta(s),\,\sigma^2\bigr)$ , GPO introduces a monotonic scheduling of the action range parameter $\beta_t = \beta_{\max} f(t)$ , where $f(t)$ is typically implemented as a Gompertz curve, $f(t)=\exp(-e^{-k(t-t_0)})$ , with parameters $k$ and $t_0$ that control plateau duration and growth rate.

The core transformation is given by

$T_t(a) = \tilde{a} = \beta_t \tanh\left(\frac{a}{\beta_t}\right), \quad \tilde{a} \in [-\beta_t, \beta_t].$

For small $|a|\ll\beta_t$ , $T_t(a) \approx a$ ; as training continues and $\beta_t \to \beta_{\max}$ , the transformation recovers the conventional action clipping typically applied in RL.

This growing range constrains actions early (improving stability and data collection efficiency) and allows increasingly aggressive exploration as policies mature.

2. Policy Objective Preservation and Update Mechanisms

GPO specifically modifies the actions observed by the environment. However, when integrating with the Proximal Policy Optimization (PPO) algorithm, it leaves the clipped surrogate loss function unchanged up to a variable substitution: $r_t(\theta) = \frac{\pi_\theta(\tilde{a}_t|s_t)}{\pi_{\theta_{\rm old}}(\tilde{a}_t|s_t)} = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\rm old}}(a_t|s_t)}$ The Jacobian term $|\frac{da}{d\tilde{a}}|$ cancels in the likelihood ratio, ensuring the PPO loss

$L^{\rm CLIP}(\theta) = \mathbb{E}_t\Big[\min(r_t(\theta)\hat{A}_t, \operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\Big]$

retains its standard form and properties.

3. Algorithmic Implementation and Typical Hyperparameters

The GPO procedure integrates directly into PPO workflows. The update loop incorporates time-dependent calculation of $\beta_t$ and applies the $T_t(\cdot)$ transformation before policy execution. A representative pseudocode is:

initialize θ ← θ₀
for epoch=1…N_epochs:
  for t=1…Timesteps_per_epoch:
    compute β_t from schedule
    sample a_t ∼ 𝒩(μ_θ(s_t),σ²)
    compute transformed action ṡ_t ← β_t·tanh(a_t/β_t)
    execute ṡ_t in env → s_{t+1}, r_t
    store (s_t,a_t,ṡ_t,r_t)
  end
  compute advantages Ĥ A_t
  for update=1…K:
    sample minibatch of size B
    compute r(θ)=π_θ(ṡ)/π_{θ_old}(ṡ)
    compute L^{CLIP}(θ) and ∇_θL
    θ ← θ + η ∇_θL
  end
end

Key parameter choices include learning rate $\eta \in [3\times10^{-4},\,1\times10^{-3}]$ , PPO clip $\epsilon=0.2$ , and Gompertz scheduling with $k\sim 3\times10^{-5}$ , $t_0\sim 2.4\times10^4$ , and $\beta_{\min}/\beta_{\max}\approx0.1$ .

4. Theoretical Properties and Convergence Analysis

The GPO framework offers formal guarantees concerning gradient distortion, variance, and asymptotic performance.

Proposition 1 (Gradient-Distortion Bound): If $|a/\beta_t| \le 0.5$ , the difference in score function gradients is controlled: $\bigl\|\nabla_\theta\log\pi_\theta(\tilde a) - \nabla_\theta\log\pi_\theta(a)\bigr\| \le C\,\bigl\|\nabla_\theta\mu_\theta(s)\bigr\|,$ with $C = \frac{\sinh(1)-1}{2\,\sigma^2}|\beta_{\max}-\beta_t| \to 0$ as $\beta_t \to \beta_{\max}$ .

Corollary 1 (Gradient Variance and SNR): For $g_t=\nabla_\theta\log\pi_\theta(\tilde{a}_t)A_t$ , $\E[A_t^2]=\sigma_A^2$, and $\|\nabla_\theta\mu\|\le K_0$ : $\operatorname{Var}[g_t]\le c\beta_t^2, \qquad c = \sigma_A^2 \left(\frac{2 K_0}{\sigma^2}\right)^2$ yielding signal-to-noise ratio $\operatorname{SNR} \propto \frac{1}{\beta_t}$ .

Theorem 1 (Early-Stage Convergence): Under local $\mu$ -strong convexity and $L$ -smoothness,

$\E\|\theta_t - \theta^*\|^2 \le (1-\eta\mu)^t\|\theta_0 - \theta^*\|^2 + \frac{\eta}{\mu}c \beta_t^2$

implies that small $\beta_t$ fosters tight proximity to optimal parameters $\theta^*$ during early optimization.

Theorem 2 (No-Worse Asymptotics): In the quadratic regime,

$\limsup_{t\to\infty}\E\|\theta_t-\theta^*\|^2 \le \frac{\eta^2 c}{1-\rho} \beta_\infty^2$

with $\beta_\infty\le\beta_{\max}$ , hence

$\liminf_{t\to\infty}\E[J_{\rm GPO}]\ge\liminf_{t\to\infty}\E[J_{\rm baseline}].$

This suggests that GPO policies are guaranteed to match or surpass the asymptotic performance of baseline PPO in the local regime.

5. Practical Guidance for Real-World Implementation

When deploying GPO on torque-controlled robots, it is essential to preserve the differentiable $\tanh$ mapping throughout training and hardware transfer, rather than reverting to hard clipping, to maintain gradient propagation and stability. Initial ranges $\beta_t$ should be set at 10–20% of the system's hardware torque limits to facilitate safe exploration of contact-rich scenarios; these are increased progressively to enable higher-magnitude behaviors. For controllers operating in position space, the transformation applies identically to velocity or position increment commands.

Transitioning to real hardware is most robust when maintaining the same action-mapping regime used during simulation, avoiding abrupt discontinuities in the policy's output transformation.

6. Empirical Results and Performance Benchmarks

Experimental evaluation of GPO encompasses quadruped and hexapod locomotion tasks, testing whole-body control (6 DoF base, 12 or 18 joint torques) and sim-to-real transfer.

Simulation Tracking Error Table:

Metric	GPO	PPO
Quadruped $v_x$	0.015 ± 0.035 m/s	–0.30 ± 0.10 m/s
Quadruped $v_y$	0.00 ± 0.05 m/s	–0.10 ± 0.05 m/s
Body Height	0.005 ± 0.0025 m	–0.03 ± 0.02 m

Hexapod velocity errors are similarly reduced by ≈0.1 m/s under GPO.

Policies trained with GPO exhibit regular, periodic joint torques and velocities, in contrast to baseline PPO which tends toward irregular outputs and suboptimal leg utilization.

Zero-Shot Hardware Robustness (Success Rate, 10 trials):

Task	GPO	PPO
Side push	100%	0%
Challenging terrain	100%	20%
Vertical stomp	100%	40%
Torque-based DeCAP	0–60%	—
Position-based DeCAP	40–80%	—

This suggests that GPO substantially improves sim-to-real robustness, enabling consistent success in perturbation-rich hardware evaluations where PPO-trained agents typically fail.

7. Interpretations and Implications

The GPO framework provides a general and environment-agnostic solution to the exploration—gradient estimation trade-off in RL-based legged locomotion, particularly for torque-level control. Its theoretical guarantees ensure stable policy updates, faster convergence in early training, and equal or better long-term performance relative to baseline PPO. A plausible implication is that GPO's approach can generalize to other high-dimensional continuous control domains where exploration and stability are challenged by complex action spaces and hardware constraints.

Markdown Report Issue Upgrade to Chat

References (1)

GPO: Growing Policy Optimization for Legged Robot Locomotion and Whole-Body Control (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Growing Policy Optimization (GPO).

Growing Policy Optimization (GPO)

1. Formal Structure of the Growing Policy Optimization Framework

2. Policy Objective Preservation and Update Mechanisms

3. Algorithmic Implementation and Typical Hyperparameters

4. Theoretical Properties and Convergence Analysis

5. Practical Guidance for Real-World Implementation

6. Empirical Results and Performance Benchmarks

7. Interpretations and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Growing Policy Optimization (GPO)

1. Formal Structure of the Growing Policy Optimization Framework

2. Policy Objective Preservation and Update Mechanisms

3. Algorithmic Implementation and Typical Hyperparameters

4. Theoretical Properties and Convergence Analysis

5. Practical Guidance for Real-World Implementation

6. Empirical Results and Performance Benchmarks

7. Interpretations and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research