Papers
Topics
Authors
Recent
Search
2000 character limit reached

Terrain-Specific Reward Shaping for Locomotion

Updated 9 February 2026
  • Terrain-Specific Reward Shaping is a method that designs reinforcement learning rewards using local terrain data to adapt legged locomotion policies.
  • It employs foot-terrain and lifting-foot-height reward terms to penalize unsafe footholds and reduce excessive energy use, ensuring precise terrain negotiation.
  • The approach integrates deep neural network policies with parameterized trajectory generators, yielding robust performance across challenging terrains in both simulation and real-world tests.

Terrain-specific reward shaping refers to the explicit design of reinforcement learning reward functions that leverage local terrain information to drive policy adaptation for legged locomotion. By integrating exteroceptive state cues and terrain-relevant penalties or incentives, terrain-specific reward shaping enables legged robots to modulate their motion strategies dynamically, producing both safer and more energy-efficient behaviors when traversing challenging ground. In the context of quadrupedal locomotion, this approach has been demonstrated to produce policies capable of negotiating steps, gaps, and discrete footholds—capabilities generally associated with more complex model-based or hierarchical systems—using end-to-end learning guided by terrain-aware rewards (Shi et al., 2023).

1. Mathematical Formulation of Terrain-Specific Reward Components

Terrain-specific reward shaping for quadrupedal robots primarily employs two reward terms, each targeting a distinct aspect of safe and efficient locomotion:

1.1 Foot-Terrain Reward

This term penalizes unsafe foot placements by quantifying local terrain variation at contact points. For each foot ii,

rterrain,i={0,if foot i is in swing phase, or maxjzi,jminjzi,j>Hthre 1,otherwiser_{\text{terrain},\,i} = \begin{cases} 0, & \text{if foot }i\text{ is in swing phase, or } \max_j z_{i,j} - \min_j z_{i,j} > H_{\mathrm{thre}} \ -1, & \text{otherwise} \end{cases}

where zi,j=hfootihi,jz_{i,j}=h_{\text{foot}_i}-h_{i,j} measures the signed height difference between the foot's planned placement and local terrain samples jj around it. HthreH_{\mathrm{thre}} is a user-defined threshold for allowable local height variation. No penalty is given if (a) the foot is airborne or (b) the local terrain is sufficiently flat (maxzi,jminzi,jHthre\max z_{i,j} - \min z_{i,j} \le H_{\mathrm{thre}}). A penalty of 1-1 is applied when the foot contacts terrain with excessive variation, indicating an unsafe or unstable foothold.

1.2 Lifting-Foot-Height Reward

This term discourages excessive foot clearance, incentivizing minimal energy trajectories. For foot ii:

rheight,i=max(H+δhΔHFthre,0)r_{\text{height},\,i} = -\,\max\bigl(H + \delta h - \Delta H - F_{\mathrm{thre}},\,0\bigr)

where HH is the trajectory generator’s (TG) nominal foot-lift height; δh\delta h is the policy’s residual height adjustment; ΔH=maxjhi,j\Delta H = \max_j h_{i,j} is the highest point of terrain beneath foot ii; and FthreF_{\mathrm{thre}} is a small clearance margin. The penalty increases as the foot is lifted higher above the critical clearance needed to cross the current terrain, promoting energy efficiency.

2. Integration with Additional Reward Terms

In terrain-aware reinforcement learning for locomotion, the total reward function typically combines terrain-specific terms with generic locomotion objectives:

  • rvr_v (“velocity-within-command”): Rewards progress along commanded velocity, with a Gaussian fall-off outside a specified velocity range.
  • rvor_{vo} (“velocity-out-of-command”): Penalizes undesired velocity, defined as

rvo=exp(1.5vt2(vtct)2)r_{vo} = \exp\bigl(-1.5\,\|\mathbf v_t\|^2 - (\mathbf v_t^\top \mathbf c_t)^2\bigr)

where vt\mathbf v_t is current velocity, ct\mathbf c_t the command direction.

  • rτr_\tau (“energy”): Negative mechanical power, given by

rτ=τtqv,tr_\tau = -\,\boldsymbol\tau_t^\top \mathbf q_{v,t}

where τt\boldsymbol\tau_t is the vector of joint torques and qv,t\mathbf q_{v,t} the vector of joint velocities.

  • rsmoothr_{\mathrm{smooth}} (“smoothness”): A radial-basis penalty on joint position jumps:

rsmooth=exp(0.5qtqt12)r_{\mathrm{smooth}} = \exp(-0.5\|\mathbf q_t - \mathbf q_{t-1}\|^2)

The complete reward thus incentivizes task completion, energy efficiency, kinematic smoothness, and—through the terrain-specific terms—locomotion safety with respect to local terrain geometry.

3. Interaction with the Policy and Trajectory Generator

The control architecture integrates a deep neural network (DNN) policy with a parameterized trajectory generator (TG):

  • The DNN outputs a 14-dimensional action vector. a1a_1 and a2a_2 correspond to residual swing frequency (δf\delta f) and residual foot height (δh\delta h), modulating TG phase rate and lift height, respectively. The remaining a3,,a14a_3,\ldots,a_{14} provide residual joint-angle corrections.
  • In operation, the policy adjusts both the temporal properties of the gait and the vertical clearance of swing trajectories, informed by proprioceptive and exteroceptive state feedback.
  • Terrain-specific reward terms drive the DNN to modulate δh\delta h (raising the foot as needed to clear obstacles and minimizing lift where ground is flat) and δf\delta f (adjusting swing timing for discrete or continuous terrain support).
  • The result is a controller that adapts both foot placement and kinematic parameters in response to terrain, achieving robust locomotion through a learning-based approach instead of relying on explicit model-based hierarchies.

4. Empirical Impact of Terrain-Specific Reward Shaping

Simulation and real-world evaluations of the described approach reveal distinct impacts from individual terrain-aware reward terms:

  • Stepping Stones & Poles (discrete, high-risk terrain): The foot-terrain reward proves essential. Policies omitting perception or using fixed-height TGs fail on large gaps or narrow poles (max travel ≈1.5–1.9 m for 25 cm gaps or 10–15 cm poles), while the full terrain-aware policy reliably traverses a 5 m course in all directions (mean ≈4.95 m, σ<0.1 m). The negative penalty for unsafe contacts compels the robot to “search” for safer footholds.
  • Stairs (continuous, moderate-risk terrain): The lifting-foot-height reward enables adaptive clearance, with learned H+δhH+\delta h matching the step height (e.g., ≈3 cm for 3 cm stairs), compared to non-terrain-aware controllers’ static 10–12 cm lifts. This adaptation leads to a 10–20% reduction in cumulative joint power relative to fixed-height baselines, supporting energy-efficient behaviors.
  • Smooth, Continuous Blocks: All policies, including those without perception or terrain-awareness, can cross continuous blocks, indicating minimal risk in such settings. However, the terrain-aware policy maintains its energy advantage by reducing unnecessary foot lift.

5. Table: Terrain‐Specific Reward Terms

Reward Term Formulation Functional Purpose
Foot-terrain reward rterrain,ir_{\text{terrain},\,i} as defined above Penalize unsafe foot contacts
Lifting-foot-height rheight,ir_{\text{height},\,i} as defined above Penalize excessive foot lift (energy efficiency)

6. Insights from Ablation and Hardware Validation

Empirical ablation studies validate the necessity of terrain-specific rewards:

  • The absence of exteroceptive perception or removal of the foot-lift modulation (i.e., using a fixed-height TG) results in policy failure when faced with discrete footholds or unnecessary energy expenditure on simple terrain.
  • In real-robot experiments, the approach demonstrated traversal of stepping stones with 25.5 cm gaps, mirroring simulation performance and supporting the real-world applicability of terrain-specific reward criteria.
  • The joint use of foot-terrain and lifting-foot-height rewards yields policies that simultaneously minimize fall risk and energy consumption. The DNN’s adaptive tuning of TG parameters under these combined rewards achieves terrain-awareness typically reserved for multi-level or model-based schemes (Shi et al., 2023).

7. Significance and Broader Context

Terrain-specific reward shaping facilitates fundamentally robust and adaptive quadrupedal locomotion in the presence of various environmental challenges. By leveraging exteroceptive feedback and penalizing unsuitable ground contacts as well as excessive energy use via tailored reward functions, terrain-aware policies can generalize from simulation to hardware and across a spectrum of terrains. A plausible implication is that such reward design principles could be generalized to other legged robot morphologies and terrain classes, subject to appropriate design of local reward criteria and exteroceptive sensing mechanisms. The approach bridges the gap between domain knowledge–driven controller design and end-to-end learning-based strategies, yielding interpretable, modular reward structures with clear behavioral consequences (Shi et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Terrain-Specific Reward Shaping.