DreamControl-v2: Simpler and Scalable Autonomous Humanoid Skills via Trainable Guided Diffusion Priors

Published 31 Mar 2026 in cs.RO | (2604.00202v1)

Abstract: Developing robust autonomous loco-manipulation skills for humanoids remains an open problem in robotics. While RL has been applied successfully to legged locomotion, applying it to complex, interaction-rich manipulation tasks is harder given long-horizon planning challenges for manipulation. A recent approach along these lines is DreamControl, which addresses these issues by leveraging off-the-shelf human motion diffusion models as a generative prior to guide RL policies during training. In this paper, we investigate the impact of DreamControl's motion prior and propose an improved framework that trains a guided diffusion model directly in the humanoid robot's motion space, aggregating diverse human and robot datasets into a unified embodiment space. We demonstrate that our approach captures a wider range of skills due to the larger training data mixture and establishes a more automated pipeline by removing the need for manual filtering interventions. Furthermore, we show that scaling the generation of reference trajectories is important for achieving robust downstream RL policies. We validate our approach through extensive experiments in simulation and on a real Unitree-G1.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces a trainable guided diffusion model in robot space that eliminates manual retargeting to boost autonomous humanoid skill acquisition.
It leverages heterogeneous datasets and a transformer-based diffusion approach to generate physically plausible joint trajectories with precise spatio-temporal conditioning.
The system achieves robust sim-to-real transfer and scalable reinforcement learning policy training, demonstrating significant performance improvements in complex tasks.

DreamControl-v2: Trainable Guided Diffusion Priors for Scalable Humanoid Skill Acquisition

Introduction and Motivation

Autonomous general-purpose loco-manipulation for humanoid robots remains a primary challenge, particularly when aiming for data efficiency and scalable generation of complex, interactive skills. Recent approaches like DreamControl employ off-the-shelf, human-centric diffusion priors in combination with RL to synthesize plausible reference trajectories, which RL policies are trained to track. However, such pipelines suffer critical limitations: spatial constraints are imposed in the human motion domain, retargeting to the robot embodiment introduces misalignment, and manual trial-and-error calibration or filtering are necessary to adapt to target tasks, fundamentally bottlenecking scalability and generality.

DreamControl-v2 proposes a comprehensive architectural and data-centric shift: train a guided diffusion model directly in the robot's action and embodiment space using a heterogeneous mixture of human and robot motion data. This strategy achieves seamless, large-scale reference trajectory synthesis, enables precise spatio-temporal conditioning in the robot configuration space, and eliminates manual retargeting and intervention. The result is a streamlined, automated pipeline capable of learning robust, transferable policies for a diverse range of physically plausible loco-manipulation skills.

Methodology

Pre-Retargeted, Heterogeneous Data Curation

DreamControl-v2 constructs a large-scale robot-space dataset by retargeting diverse human motion datasets (AMASS, HumanML3D, GRAB, Nymeria) and robot motion datasets (OmniRetarget) to the $G1$ Unitree humanoid skeleton. A joint set of shared kinematic keypoints is used to align human and robot configurations. The retargeting process involves optimization over joint angle correspondences, scale factors, and physical constraints (e.g., preventing foot slip, preserving contact moments). For non-SMPL data, a staged optimization projects motions into a unified representation via confidence-weighted minimization of keypoint discrepancies.

Crucially, the pipeline includes rigorous automatic filtering for physical feasibility (e.g., discarding unexecutable or context-dependent trajectories) and task-specific selection via natural language annotation search. No manual parameter tuning or post-hoc retargeting is performed—the training corpus is robot-ready.

Guided Diffusion in Robot Space

Using the retargeted corpus, DreamControl-v2 trains a transformer-based diffusion model parameterized in a canonicalized, root-relative pose space shared with human motion models. The model is conditioned jointly on open-vocabulary textual prompts and flexible spatio-temporal control signals specifying 3D positions for any subset of joints at arbitrary times. The realism guidance mechanism, akin to ControlNet's architectural insertion, is employed to enhance physical plausibility by directly modulating the internal transformer state based on constraints, reducing artifacts such as foot skating.

The diffusion model outputs physically plausible joint-space trajectories in the $G1$ embodiment, satisfying both high-level semantics and precise spatial objectives without retargeting or manual correction.

Physics-based RL and Policy Training

RL policies are trained to maximize both task reward and trajectory imitation (tracking) rewards, formulated with respect to the generated reference trajectories. The policies operate in the robot's state and action space and receive only proprioceptive and local object state observations, supporting robust sim-to-real transfer. The action space includes all relevant robot joints and binary hand control, matching the downstream deployment setup.

Automated filtering and post-optimization ensure all reference trajectories are physically feasible and satisfy task requirements prior to policy training, further improving RL sample efficiency and stability.

Sim2Real and Deployment

The pipeline is validated on the Unitree-G1 platform with all real-world task deployments strictly following the automated pipeline. Eight core skills—including articulated-object manipulation and bimanual tasks—are demonstrated in both simulation and physical robot execution, without reliance on teleoperation or additional human intervention.

Experimental Analysis

Data Scaling and Generalization

Empirical evaluation across multiple data mixtures (from AMASS-only to a full heterogeneous mix) demonstrates that increasing the diversity and task coverage of the training corpus systematically improves generation fidelity and downstream policy performance. FID drops monotonically, R-Precision for text-motion alignment rises, and the physical plausibility (control L2, skating ratio) of trajectories improves markedly as more interaction-rich and robot-refined trajectories are included. The Full-Mix model exhibits broad generalization: on AMASS test splits, it retains original distributional performance; on novel task distributions (Nymeria, GRAB, OmniRetarget), it far outperforms models trained solely on AMASS.

Diffusion Model and RL Policy Scaling

Systematic scaling of the number of generated reference trajectories shows direct gains in RL policy performance and generalization to unseen object placements and task configurations. The success ratio increases with the number of sampled trajectories, with diminishing returns as the coverage saturates the task distribution.

DreamControl-v2 policies surpass both zero-shot human-domain baselines (OmniControl) and previous manual retargeting pipelines (original DreamControl), particularly on tasks involving articulated object interaction and complex, multi-contact behaviors that were outside the human dataset domain.

Pipeline Simplification and Fine-tuning

Direct training of the diffusion prior in robot space eliminates costly inference-time spatial condition calibration, which is shown to be highly sample-inefficient and error-prone. Pre-retargeted data and model initialization using human motion priors (MDM) yield superior sample efficiency and final performance compared to training from scratch, confirming the utility of transfer learning from human datasets to humanoid control.

Implications and Future Directions

DreamControl-v2 offers strong evidence that large-scale, pre-retargeted, guided diffusion models in robot embodiment space are a definitive step forward for scalable autonomous skill acquisition on high-DOF humanoids. The approach clearly outperforms human-space generative pipelines, obviates manual engineering bottlenecks, and supports robust RL training and sim-to-real transfer across physically diverse and interactive tasks.

Theoretical Implications

The study shows that unified embodiment parameterization for humans and robots enables effective transfer and extension of generative models.
Spatio-temporal guided diffusion priors serve as powerful, scalable RL supervision sources for long-horizon, contact-rich tasks, bridging the gap between imitation and open-ended task synthesis.

Practical Implications

The pipeline provides a pathway for automated expansion of skill libraries with minimal engineering.
Robust, zero-shot deployment of new tasks (given textual and spatial constraints) is enabled by the generalization capacity of the model.
The framework is immediately compatible with larger or more diverse data sources (including teleoperation, video-based pose extraction, and simulation-generated data).

Future Research

Future lines of inquiry may include:

Extending the guided diffusion prior to multi-robot, multi-object scenes.
Incorporating closed-loop, vision-based feedback into both generation and policy conditioning, utilizing recent LMM and VLA model advances.
Exploring active data collection pipelines, where failed or underexplored trajectories trigger automatic data augmentation and re-training.
Integrating task-oriented curriculum learning directly within the diffusion sampling or RL optimization process.

Conclusion

DreamControl-v2 establishes a data- and model-centric reformulation of autonomous humanoid skill acquisition, demonstrating that training guided diffusion priors in the robot's action space yields significant advances in trajectory quality, task generalization, and system automation. The work provides a robust, extensible architecture for scalable loco-manipulation, effectively closing the gap between generative modeling and practical, deployable, whole-body autonomous control (2604.00202).

Markdown Report Issue