Papers
Topics
Authors
Recent
Search
2000 character limit reached

Joint Torque Perturbation Injection

Updated 11 December 2025
  • The paper presents JT-SPI, a method that injects state-dependent torque perturbations to expand the range of dynamics discrepancies during simulation.
  • It utilizes an MLP to generate zero-mean perturbations applied to nominal joint torques, thereby improving robustness against unmodeled disturbances.
  • Empirical results show JT-SPI achieves 100% zero-shot success on real hardware, outperforming traditional domain randomization and ERFI methods.

Joint torque space perturbation injection (JT-SPI) is a methodology for improving the robustness of learned control policies for legged robots, particularly in the context of sim-to-real transfer for humanoid locomotion. Unlike standard domain randomization, which varies simulation parameters within a finite set, JT-SPI introduces direct, state-dependent perturbations to the joint torque inputs during simulation. This technique exposes control policies to a much broader and more abstract class of reality gaps, including those not easily parameterizable in standard simulators, thereby enabling superior generalization and resilience to unmodeled disturbances (Cha et al., 9 Apr 2025).

1. Mathematical Foundations

JT-SPI operates by perturbing the nominal joint torques generated by a learned policy πθ\pi_\theta for each control step in simulation. The key components are:

  • The policy produces normalized actions at[1,1]na_t \in [-1,1]^n, for nn robot joints, mapping to torques via element-wise scaling with torque limits τlim\tau_\text{lim}:

τnom(st)=τinput,t=τlimat\tau_\text{nom}(s_t) = \tau_\text{input}, t = \tau_\text{lim} \odot a_t

  • A multi-layer perceptron (MLP) τϕ:OprivRn\tau_\phi : O_\text{priv} \rightarrow \mathbb{R}^n injects a zero-mean, state-dependent perturbation:

δτt=τϕ(opriv,t)=σlimtanh(MLPϕ(normalize(opriv,t)))\delta \tau_t = \tau_\phi(o_{\text{priv},t}) = \sigma_\text{lim} \cdot \tanh(\text{MLP}_\phi(\text{normalize}(o_{\text{priv}, t})))

σlim\sigma_\text{lim} is the maximum perturbation magnitude.

  • This perturbation can be modeled as sampling from a conditional distribution:

δτtN(0,Σ(st))\delta \tau_t \sim \mathcal{N}(0, \Sigma(s_t))

where the covariance Σ(st)\Sigma(s_t) is implicitly encoded by the MLP’s output structure and scaling. Layer biases are set to zero, ensuring MLPϕ(0)=0\text{MLP}_\phi(0) = 0, and inputs are normalized by running standard deviation.

  • The forward dynamics are modified as:

st+1=fsim(st,τnom(st)+δτt)s_{t+1} = f_\text{sim}(s_t, \tau_\text{nom}(s_t) + \delta \tau_t)

with the joint-space equation:

Mˉ(q)q¨+Cˉ(q,q˙)+Gˉ(q)+τˉcontact(s)=τinput+δτ\bar{M}(q)\,\ddot{q} + \bar{C}(q,\dot{q}) + \bar{G}(q) + \bar{\tau}_\text{contact}(s) = \tau_\text{input} + \delta \tau

Effectively, JT-SPI substitutes the simulator’s parametric error model τDR(s;pDR)\tau_{DR}(s; p_{DR}) with a sample from a broad family of nonlinear, state-dependent torque errors.

2. Training Protocol and Implementation

JT-SPI is incorporated into standard on-policy reinforcement learning pipelines such as PPO. The procedure comprises:

  • Use parallelized simulation environments, with perturbations injected into half the environments and the other half left unmodified to prevent overfitting to the perturbed domain.
  • At each training rollout, sample new MLP weights ϕ\phi via Xavier initialization; ϕ\phi is not learned, but randomly resampled per episode.
  • Each simulation step applies:
    • Nominal policy torque as above.
    • If in a perturbed environment: evaluate δτ\delta \tau using the privileged observation oprivo_\text{priv}, including normalized simulator state quantities (base pose, velocities, joint angles and velocities, input torque, contact wrenches).
    • The robot is then simulated with τnom+δτ\tau_\text{nom} + \delta \tau.
  • Policy and value updates follow standard PPO with an additional adversarial motion-prior loss (AMP) and a small gradient penalty, aggregating as:

Ltotal=LPPO+LAMP+0.002LgradpenL_\text{total} = L_\text{PPO} + L_\text{AMP} + 0.002 \cdot L_\text{gradpen}

Perturbation and network parameters are typically: σlim, joint=50\sigma_\text{lim, joint} = 50 Nm, σlim, base=80\sigma_\text{lim, base} = 80 N, a 2×2562 \times 256-unit ReLU MLP with tanh output, zero biases, and ϕ\phi resampled per episode.

3. Comparative Evaluation

Performance was benchmarked against:

  • Domain Randomization (DR): Standard randomization of simulation parameters (masses, inertias, friction, damping, actuator properties).
  • ERFI: State-independent random force-injection (baseline from Campanaro et al. 2024).
  • JT-SPI: As described, with state-dependent torque perturbations.

Key comparative findings:

Scenario DR ERFI JT-SPI
Nominal, target 0.4 m/s ±0.02\pm 0.02 m/s error as DR as DR
Actuator gap (stiffness 250) 0/3 success (all fall) 3/3 stable, <0.05 3/3 stable, <0.03
Contact gap (soft ground) 0/3 success (all fall) 0/3 3/3 ≈0.38 m/s
Real robot, uneven floor 2/3, RMS ≈0.06 m/s 0/3 3/3, RMS ≈0.04

Perturbing only half the environments was found to be critical; using all perturbed environments led to policy pathologies, while proper input normalization and zero-bias in τϕ\tau_\phi were necessary for stable learning.

JT-SPI achieves 100%100\% zero-shot success on real hardware, compared to approximately 67%67\% for DR and 0%0\% for ERFI, indicating a significant expansion in the set of reality gaps handled (Cha et al., 9 Apr 2025).

4. Practical Implementation Guidelines

JT-SPI is compatible with physics engines such as MuJoCo or IsaacGym. Direct torque injection occurs via the control interface (ctrl vector). For floating-base robots, base forces may also be perturbed, though base moment perturbations are often omitted.

Efficient parallelization is advised: privileged observation normalization and MLP evaluation should be vectorized across environments and, if possible, directly implemented on the GPU (e.g., through custom PyTorch operators in IsaacGym).

Recommended hyperparameters include:

  • σlim, joint=50\sigma_\text{lim, joint} = 50 Nm
  • σlim, base=80\sigma_\text{lim, base} = 80 N
  • MLP size 256256256 \to 256 with ReLU activations, tanh output, and zero bias
  • PPO learning rate 3×1043 \times 10^{-4}, minibatch size $98,304$, $5,000$ updates
  • AMP loss weight α0.5\alpha \approx 0.5, gradient penalty $0.002$
  • Privileged state channels normalized by running standard deviation (zero maps to zero)

5. Key Empirical Results

JT-SPI-trained policies demonstrate the following:

  • Across all tested dynamics perturbations (including actuator stiffness and soft ground), JT-SPI policies exhibit higher success rates and lower command-tracking errors than those trained with DR or ERFI.
  • Equivalent nominal performance: Under standard simulation conditions, all methods achieve similar performance, with tracking error within ±0.02\pm 0.02 m/s for a $0.4$ m/s target velocity.
  • Superior resilience to unmodeled disturbances: JT-SPI policies yield more stable center-of-mass trajectories and contact force profiles.
  • In sim-to-real transfer to TOCABI humanoid hardware on an irregular laboratory floor, JT-SPI yields $3/3$ success (RMS error 0.04\approx 0.04 m/s), whereas DR and ERFI attain $2/3$ and $0/3$, respectively.

6. Limitations and Potential Extensions

Current limitations of JT-SPI include:

  • The perturbation network weights ϕ\phi are sampled randomly per episode; meta-learning of adversarial or worst-case ϕ\phi via adversarial RL (cf. RARL) is a potential extension.
  • The model assumes zero-mean perturbations; real hardware may present biased errors, which could be incorporated by learning bias terms.
  • Only joint torques and base (but not moments) are perturbed; extending to contact-frame perturbations or state-channel noise may further enrich modeled reality gaps.
  • Safety under very large σlim\sigma_\text{lim} is not guaranteed over long horizons; scheduled annealing of perturbation magnitude during training could mitigate this risk.

This suggests that further extension of JT-SPI could yield even broader robustness, particularly by integrating learned bias, adversarial perturbation, or contact-channel noise modeling.

7. Broader Implications

By allowing arbitrary, state-dependent distortions in torque space, JT-SPI systematically enlarges the diversity of dynamics discrepancies to which policies are exposed during training. This method enables control policies to generalize beyond the finite families of simulated parameter uncertainty typically considered in domain randomization, thus achieving higher rates of zero-shot real-world success in complex humanoid locomotion tasks (Cha et al., 9 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Joint Torque Space Perturbation Injection.