Compliant Residual DAgger in Robotic Manipulation
- The paper introduces CR-DAgger, a method that integrates a compliant human interface with force-informed residual policy learning to enable precise on-policy corrections in contact-rich robotic tasks.
- CR-DAgger employs an admittance-style compliant controller to log human corrections, generating delta action datasets that efficiently train a residual policy alongside a base policy.
- Empirical results demonstrate remarkable improvements, with success rates rising from 40% to 100% in tasks like book flipping and from 20% to 70% in belt assembly.
Compliant Residual DAgger (CR-DAgger) is an extension to the classical Dataset Aggregation (DAgger) framework for learning control policies in real-world, contact-rich robotic manipulation, with a focus on enabling efficient and precise on-policy human corrections. By introducing a compliant human-in-the-loop interface and a force-informed residual policy learner, CR-DAgger addresses the central challenges of action correction collection and policy updating in physical environments that exhibit complex contact dynamics. The methodology demonstrates significant gains in manipulation success rates on tasks such as book flipping and belt assembly—outperforming traditional retraining and finetuning strategies—using minimal human intervention data (Xu et al., 20 Jun 2025).
1. Compliant Intervention Interface
The core innovation in CR-DAgger is the Compliant Intervention Interface, which facilitates the collection of corrective actions from humans without disrupting autonomous policy execution. This is achieved through the use of an admittance-style compliance controller on the robot’s end effector. The dynamics of the compliant controller in continuous time are governed by
where , , and are diagonal matrices representing the virtual mass, damping, and stiffness, respectively, and is the sum of environmental and human-applied forces.
In discrete implementation, the commanded control at time is
or equivalently,
In practical deployments, is set at approximately $1000$ N/m for each axis and is tuned for critical damping, yielding a compliant, "soft" interface. Under this configuration, human operators feel the intended motions via haptic back-drivability and can inject precise, low-amplitude delta corrections during ongoing policy execution.
Human corrections are logged using a detachable handle and button: when the button is pressed, at each ,
- (full robot state, including proprioception and image data),
- (measured 6D force/torque from the end-effector sensor),
- and (nominal and compliant actions), are recorded. The delta action is computed and stored, establishing a dataset of state-force-delta tuples for residual learning.
2. Compliant Residual Policy Formulation
To efficiently leverage the delta correction dataset, CR-DAgger introduces the Compliant Residual Policy. The method freezes the base policy ’s vision and temporal backbones, enabling the new policy to focus on producing short-horizon residual trajectories.
At inference, receives as input the current state and force measurement (and optionally the base action ), and outputs a residual action sequence . The final command is the sum
executed at 50 Hz. Supervised training of proceeds with mean-squared error loss,
with L2 regularization and explicit zero-residual labels on "no-correction" intervals.
3. Algorithmic Workflow
CR-DAgger operates in a DAgger-style loop, but single-batch aggregation suffices when the base policy achieves at least 10–20% initial success. The intervention dataset is collected as the robot executes the current policy; human corrections are incorporated on-the-fly via the compliant interface. Following data collection (approximately 50 episodes), the residual policy is trained on the accumulated corrections, and the final policy is the composition .
Key steps include:
- Deploy base policy ; apply compliant control law.
- Record whenever human intervention occurs.
- After a single aggregation batch, train on all collected deltas.
- Deploy the composite policy for evaluation or further use.
The force information is integral both to the compliant control at execution and as an input channel to the learned residual policy.
4. Quantitative Performance and Empirical Results
CR-DAgger was evaluated on two real-world contact-rich tasks using a UR5 arm equipped with a wrist camera and ATI 6-axis force/torque sensor:
- Book flipping: Inserting prongs under a book, flipping it upright, and pushing flush to a stand.
- Belt assembly: Threading a narrow belt onto pulleys, tensioning, and releasing.
The base policy, a diffusion-based visuomotor model trained on 150 demonstrations, reached 40% and 20% success on book flipping and belt assembly, respectively. CR-DAgger achieved the following improvements:
| Task | Base Policy | Retrain Offline | Finetune | Position-only Residual | CR-DAgger |
|---|---|---|---|---|---|
| Book Flipping | 40% | ~38% | ~10% | 70% | 100% |
| Belt Assembly | 20% | ~25% | ~5% | 50% | 70% |
Stage-wise analysis highlights that, in the book flipping task's "push" phase, CR-DAgger predicted approximately 15 N of additional push force, yielding 100% stage success compared to 35% for position-only residuals. In belt threading, force feedback increased alignment success by 30 percentage points (Xu et al., 20 Jun 2025).
5. Practical Implementation Recommendations
Deployment of CR-DAgger involves tuning several key parameters and adhering to empirically derived guidelines:
- Set N/m and for critical damping to enable 1 N of human-applied force to induce 1 mm of end-effector displacement.
- Instruct human annotators to provide corrections at the first indication of policy drift, but restrict interventions to low-amplitude, continuous deltas rather than full teleoperation.
- Aggregate data in a single batch of approximately 50 correction episodes if the base policy demonstrates at least moderate success. Iterative or small minibatch collection is discouraged to mitigate instability.
- Oversample training frames immediately post-correction onset (by a factor of four) to enhance the system's reactive behavior.
- For new tasks, ensure an initial “seed” policy with at least 10–20% baseline success, implement a compliant handle and F/T sensing on the robot, collect on-policy corrections, and focus training around observed failure points.
6. Context and Significance
CR-DAgger reframes the classic DAgger approach for real-world contact-rich robotic manipulation. It does so by pairing kinesthetic, compliant human intervention—achieved through force-feedback and haptic transparency—with a lightweight, modular residual policy architecture. The system attains over 50 percentage-point improvement in complex manipulation tasks, requiring fewer than 50 human interventions without full policy restarts or exhaustive retraining. This methodology provides practical advances and guidelines for effective on-policy correction and incremental policy improvement in high-stakes, contact-dominated robotic applications (Xu et al., 20 Jun 2025).
A plausible implication is that, by decoupling corrections from policy execution and leveraging force-informed residual updates, CR-DAgger sets a precedent for scalable, sample-efficient human-in-the-loop learning in robotics where physical interaction is both frequent and delicate.