Delta Action Modeling (ASAP)
- Delta action modeling in ASAP is a direct action-space correction method that learns a residual policy to align simulated actions with real-world dynamics.
- The approach leverages a two-stage workflow where a nominal policy is first trained in simulation and then fine-tuned with real-world data to reduce tracking errors.
- Quantitative results demonstrate that ASAP’s delta action framework consistently outperforms conventional techniques in sim-to-sim and sim-to-real transfers.
Delta action modeling, as operationalized within the ASAP framework (Aligning Simulation and Real-World Physics), is a direct action-space compensation method for mitigating the dynamics gap between simulation and real-world control in agile humanoid robotics. Rather than tuning simulation parameters or randomizing environmental properties, a learned residual action policy outputs corrections that, when added to standard simulated actions, induce simulated transitions that closely reproduce real-world robot trajectories. This approach enables whole-body policies with high agility and coordination, surpassing conventional system identification, domain randomization, and delta dynamics learning in both sim-to-sim and sim-to-real transfer regimes (He et al., 3 Feb 2025).
1. Conceptual Foundations of Delta Action Modeling
Delta action modeling is predicated on the observation that even high-fidelity simulators exhibit systematic discrepancies from real-world robot dynamics, especially for agile, highly-actuated humanoid behaviors. Instead of correcting for these mismatches at the level of physical parameters (as in system identification), or relying on aggressive domain randomization (which can yield overly conservative policies), the delta modeling approach operates directly in the action space. Formally, a corrective (residual) action model is learned such that
where is the nominal simulated action, and is the residual produced for the state-action pair. When applying in simulation, the subsequent transition is trained to match the real-world state as closely as possible. This structure allows the original policy to remain robust to unmodeled dynamics after fine-tuning within this residual-corrected simulator.
2. Mathematical Architecture and Training Protocol
Let denote the nominal policy pre-trained entirely in simulation:
The delta action policy is parameterized as a multilayer perceptron. Its input is a concatenation of and (or for the reduced ankle model), giving a total input dimension of 82 (resp. 63). The network comprises two shared hidden layers (width 256, ReLU activations; layer normalization before each), an output linear layer matching the action dimension, and a scaled tanh for range. L2 regularization with weight is applied.
The reward for delta model training at each step incorporates squared error in state prediction, as well as a penalty on the action norm:
3. Two-Stage ASAP Workflow
Stage 1: Nominal Policy Training in Simulation
- Human motion data is retargeted (via TRAM [wang2025tram]) and cleaned for robot feasibility using MaskedMimic in IsaacGym, yielding physical motion tracks for imitation learning.
- The policy is trained with PPO using actor–critic MLPs (2×256), domain randomization on friction, PD gains, and delay, and a phase curriculum. Actions are target joint angles for a low-level PD controller. The reward prioritizes reference tracking and regularizes joint and base motion.
Stage 2: Delta Action Learning and Fine-Tuning
- The sim-trained policy is deployed on a Unitree G1 robot. Full proprioception (base position, velocities, joint states) is logged at 100 Hz via MoCap.
- The delta policy is trained using PPO in IsaacGym, replaying real-world action/state pairs, initializing the simulator at each real , and maximizing reward for sim-to-real state alignment.
- Once converged, the trained is integrated into the simulator. The nominal policy is then fine-tuned in this residual-corrected simulation, ensuring robustness to real-world discrepancies.
- The policy is deployed in the real environment with the learned residual for closed-loop evaluation.
Pseudocode for these stages is provided in the ASAP manuscript (see Technical Report, Sec. 5) (He et al., 3 Feb 2025).
4. Quantitative Evaluation and Empirical Performance
Evaluation metrics include global mean per-joint position error (), root-relative MPJPE, acceleration error (), velocity error (), and success rate (percentage of rollouts remaining within 0.5 m of reference). Closed-loop performance was examined in sim-to-sim settings (IsaacGym-to-IsaacSim, IsaacGym-to-Genesis) and sim-to-real on Unitree G1.
Table: Representative Results for Sim-to-Sim and Sim-to-Real Transfer
| Scenario | Method | Succ (%) | g-MPJPE (mm) | MPJPE (mm) |
|---|---|---|---|---|
| IsaacGym→IsaacSim | Vanilla | 100 | 107 | 45.4 |
| ASAP | 100 | 106 | 44.3 | |
| IsaacGym→Genesis | Vanilla | 100 | 140 | 70.1 |
| ASAP | 100 | 125 | 73.5 | |
| Unitree G1 (Kick) | Vanilla | – | 61.2 | 43.5 |
| ASAP | – | 50.2 | 40.1 |
ASAP’s delta action approach consistently yields lower tracking errors and higher success rates. In the LeBron "Silencer" out-of-distribution test, ASAP achieves $47.5$ mm MPJPE versus $55.3$ mm for the vanilla policy, and reduces g-MPJPE from $159.0$ mm (vanilla) to $112.0$ mm.
Ablations confirm that naïve action-noise injection, fixed-point, or gradient-based compensation underperform relative to RL-trained delta action models. Residual actions learned via delta modeling yield direct improvements in tracking and agility for both in-distribution and OOD motions.
5. Comparative Analysis with Preceding Sim-to-Real Techniques
Delta action modeling delivers fundamental differences from:
- System Identification (SysID): Instead of exhaustive search or experiment-driven parameter fitting, delta action modeling obviates explicit actuator or contact parameter selection, and does not require torque sensing.
- Domain Randomization (DR): Large-scale randomization leads to policies robust but often conservative. Delta action modeling retains agility by focusing correction on the minimal action-space residuals required for sim-to-real consistency.
- Delta Dynamics Learning: Modeling residuals in the state transition domain may suffer from accumulating prediction errors; direct action-space correction avoids this compounding and leverages RL objectives to target rollout accuracy.
6. Strengths, Limitations, and Extension Directions
Strengths
- Direct action-space correction bridges sim-to-real mismatch without high-dimensional parameter tuning.
- The residual correction is low-dimensional and data-efficient, requiring fewer real-world samples than full-dynamics identification.
- Integrating a frozen delta action model during fine-tuning prevents instability due to drifting compensation or overfitting.
Limitations
- Training full 23-DoF delta models currently faces thermal and hardware limitations on real robots, partially addressed by focusing on the ankle joints.
- Dependence on optical motion capture for ground truth state acquisition; more scalable alternatives could include markerless vision or onboard estimation.
- Data demand for residual learning could be reduced through meta-learning or online adaptation methodologies.
Future Extensions
- Application of delta action models to other legged platforms, such as quadrupeds or exoskeletons.
- Integration with probabilistic dynamics models or learned priors for hybrid sim-to-real adaptation.
- Reduction in real-world data requirements via few-shot learning techniques.
7. Role within the Broader Reinforcement Learning and Robotics Landscape
Delta action modeling in ASAP exemplifies a contemporary focus on sim-to-real transfer via direct function approximation in action space, rather than indirect adjustment at the parameter or transition level. This enables rapid deployment of expressive and highly agile humanoid skills. The methodology can be seen as a generalizable corrective strategy, potentially complementary to techniques that promote action smoothness, such as the action-predictive ASAP formulation for RL oscillation reduction (Kwak et al., 26 Jan 2026), but uniquely tailored for high fidelity dynamical consistency between synthetic and physical platforms (He et al., 3 Feb 2025).