Joint Torque Perturbation Injection
- The paper presents JT-SPI, a method that injects state-dependent torque perturbations to expand the range of dynamics discrepancies during simulation.
- It utilizes an MLP to generate zero-mean perturbations applied to nominal joint torques, thereby improving robustness against unmodeled disturbances.
- Empirical results show JT-SPI achieves 100% zero-shot success on real hardware, outperforming traditional domain randomization and ERFI methods.
Joint torque space perturbation injection (JT-SPI) is a methodology for improving the robustness of learned control policies for legged robots, particularly in the context of sim-to-real transfer for humanoid locomotion. Unlike standard domain randomization, which varies simulation parameters within a finite set, JT-SPI introduces direct, state-dependent perturbations to the joint torque inputs during simulation. This technique exposes control policies to a much broader and more abstract class of reality gaps, including those not easily parameterizable in standard simulators, thereby enabling superior generalization and resilience to unmodeled disturbances (Cha et al., 9 Apr 2025).
1. Mathematical Foundations
JT-SPI operates by perturbing the nominal joint torques generated by a learned policy for each control step in simulation. The key components are:
- The policy produces normalized actions , for robot joints, mapping to torques via element-wise scaling with torque limits :
- A multi-layer perceptron (MLP) injects a zero-mean, state-dependent perturbation:
is the maximum perturbation magnitude.
- This perturbation can be modeled as sampling from a conditional distribution:
where the covariance is implicitly encoded by the MLP’s output structure and scaling. Layer biases are set to zero, ensuring , and inputs are normalized by running standard deviation.
- The forward dynamics are modified as:
with the joint-space equation:
Effectively, JT-SPI substitutes the simulator’s parametric error model with a sample from a broad family of nonlinear, state-dependent torque errors.
2. Training Protocol and Implementation
JT-SPI is incorporated into standard on-policy reinforcement learning pipelines such as PPO. The procedure comprises:
- Use parallelized simulation environments, with perturbations injected into half the environments and the other half left unmodified to prevent overfitting to the perturbed domain.
- At each training rollout, sample new MLP weights via Xavier initialization; is not learned, but randomly resampled per episode.
- Each simulation step applies:
- Nominal policy torque as above.
- If in a perturbed environment: evaluate using the privileged observation , including normalized simulator state quantities (base pose, velocities, joint angles and velocities, input torque, contact wrenches).
- The robot is then simulated with .
- Policy and value updates follow standard PPO with an additional adversarial motion-prior loss (AMP) and a small gradient penalty, aggregating as:
Perturbation and network parameters are typically: Nm, N, a -unit ReLU MLP with tanh output, zero biases, and resampled per episode.
3. Comparative Evaluation
Performance was benchmarked against:
- Domain Randomization (DR): Standard randomization of simulation parameters (masses, inertias, friction, damping, actuator properties).
- ERFI: State-independent random force-injection (baseline from Campanaro et al. 2024).
- JT-SPI: As described, with state-dependent torque perturbations.
Key comparative findings:
| Scenario | DR | ERFI | JT-SPI |
|---|---|---|---|
| Nominal, target 0.4 m/s | m/s error | as DR | as DR |
| Actuator gap (stiffness 250) | 0/3 success (all fall) | 3/3 stable, <0.05 | 3/3 stable, <0.03 |
| Contact gap (soft ground) | 0/3 success (all fall) | 0/3 | 3/3 ≈0.38 m/s |
| Real robot, uneven floor | 2/3, RMS ≈0.06 m/s | 0/3 | 3/3, RMS ≈0.04 |
Perturbing only half the environments was found to be critical; using all perturbed environments led to policy pathologies, while proper input normalization and zero-bias in were necessary for stable learning.
JT-SPI achieves zero-shot success on real hardware, compared to approximately for DR and for ERFI, indicating a significant expansion in the set of reality gaps handled (Cha et al., 9 Apr 2025).
4. Practical Implementation Guidelines
JT-SPI is compatible with physics engines such as MuJoCo or IsaacGym. Direct torque injection occurs via the control interface (ctrl vector). For floating-base robots, base forces may also be perturbed, though base moment perturbations are often omitted.
Efficient parallelization is advised: privileged observation normalization and MLP evaluation should be vectorized across environments and, if possible, directly implemented on the GPU (e.g., through custom PyTorch operators in IsaacGym).
Recommended hyperparameters include:
- Nm
- N
- MLP size with ReLU activations, tanh output, and zero bias
- PPO learning rate , minibatch size $98,304$, $5,000$ updates
- AMP loss weight , gradient penalty $0.002$
- Privileged state channels normalized by running standard deviation (zero maps to zero)
5. Key Empirical Results
JT-SPI-trained policies demonstrate the following:
- Across all tested dynamics perturbations (including actuator stiffness and soft ground), JT-SPI policies exhibit higher success rates and lower command-tracking errors than those trained with DR or ERFI.
- Equivalent nominal performance: Under standard simulation conditions, all methods achieve similar performance, with tracking error within m/s for a $0.4$ m/s target velocity.
- Superior resilience to unmodeled disturbances: JT-SPI policies yield more stable center-of-mass trajectories and contact force profiles.
- In sim-to-real transfer to TOCABI humanoid hardware on an irregular laboratory floor, JT-SPI yields $3/3$ success (RMS error m/s), whereas DR and ERFI attain $2/3$ and $0/3$, respectively.
6. Limitations and Potential Extensions
Current limitations of JT-SPI include:
- The perturbation network weights are sampled randomly per episode; meta-learning of adversarial or worst-case via adversarial RL (cf. RARL) is a potential extension.
- The model assumes zero-mean perturbations; real hardware may present biased errors, which could be incorporated by learning bias terms.
- Only joint torques and base (but not moments) are perturbed; extending to contact-frame perturbations or state-channel noise may further enrich modeled reality gaps.
- Safety under very large is not guaranteed over long horizons; scheduled annealing of perturbation magnitude during training could mitigate this risk.
This suggests that further extension of JT-SPI could yield even broader robustness, particularly by integrating learned bias, adversarial perturbation, or contact-channel noise modeling.
7. Broader Implications
By allowing arbitrary, state-dependent distortions in torque space, JT-SPI systematically enlarges the diversity of dynamics discrepancies to which policies are exposed during training. This method enables control policies to generalize beyond the finite families of simulated parameter uncertainty typically considered in domain randomization, thus achieving higher rates of zero-shot real-world success in complex humanoid locomotion tasks (Cha et al., 9 Apr 2025).