- The paper presents a unified framework (FAST) that enables robust humanoid whole-body control through large-scale pretraining and rapid adaptation.
- It leverages a Mixture-of-Experts architecture with CoM-aware design and integrates Parseval regularization with KL constraints to maintain stability during adaptation.
- Experimental results in simulation and on a Unitree G1 robot demonstrate superior long-horizon tracking, stability, and retention of source-domain performance.
FAST: A Unified Framework for General Humanoid Whole-Body Control via Pretraining and Fast Adaptation
Problem Statement and Motivation
The challenge of general whole-body control (WBC) for humanoid robots lies in executing diverse, coordinated whole-body motions with robustness to substantial distribution shifts encountered in realistic deployments. Classical approaches, often anchored in task-specific reward engineering or limited kinematic datasets, suffer performance degradation when tracking high-dynamic or out-of-distribution (OOD) motions, particularly those generated from low-fidelity modalities such as monocular video, teleoperation, or text-to-motion pipelines. Deployment constraints on inference latency and hardware further limit the practicality of foundation model scaling as a solution. FAST directly targets these limitations by explicitly designing for robust zero-shot generalization and rapid, stable adaptation to new or noisy motion distributions.
Methodological Innovations
The FAST pipeline is composed of three interdependent stages:
- Curated Motion Dataset Construction: Human-to-humanoid retargeting is performed on diverse motion capture datasets (AMASS, OMOMO, in-house MoCap), incorporating substantial data augmentation through global velocity perturbation and lower-limb configuration variability. Auxiliary physical signals—contact masks, Center-of-Mass (CoM), and Center-of-Pressure (CoP)—are integrated to reinforce physical plausibility and stability cues.
- Pretraining a Mixture-of-Experts Whole-Body Controller with CoM-Aware Control: The policy architecture leverages a Mixture-of-Experts (MoE) MLP structure with a gating network, enabling specialization across dynamic regimes while maintaining global coordination. CoM-Aware design augments observations with CoM/CoP and deploys adaptive tracking rewards and explicit stability terms, trading off strict tracking in favor of physical stability when references are aggressive or physically inconsistent. This mechanism enables the system to relax imitation loss in the presence of unexecutable references, increasing robustness in practical deployments.
- Parseval-Guided Residual Policy Adaptation: For rapid adaptation, a lightweight residual delta policy (MLP) is introduced atop the frozen base policy, with adaptation occurring exclusively in the residual head. Adaptation is regularized by:
- Parseval Regularization: Enforces near-orthogonality in feature directions, bounding sensitivity to input perturbations and enabling smoother gradients, thus sustaining stable, sample-efficient optimization during fast adaptation.
- KL-Constrained Policy Update: Maintains distributional proximity to the base policy, mitigating catastrophic forgetting and preserving prior capabilities across OOD adaptation.
These innovations are theoretically supported by formal proofs establishing bounded Lipschitz continuity (for robust smoothness) and quantitative KL-induced constraints on policy deviation magnitude.
Figure 1: FAST architecture overview—curated data construction, MoE+CoM-Aware pretraining, and Parseval-guided fast adaptation pipeline.
Experimental Evaluation
Quantitative experiments are conducted in MuJoCo and IsaacLab simulators and validated on a Unitree G1 humanoid platform. Comprehensive metrics include task success rate, global/pose/keypoint errors, global root velocity error, mean CoM–CoP distance (as a measure of balance), and slippage.
- Generalization and Robustness: On both in-distribution (AMASS) and OOD (MotionX) datasets, FAST achieves the top success rates, with pronounced improvement in long-horizon tracking and global stability compared to GMT and TWIST2. Notably, while not always minimizing MPJPE, FAST consistently yields lower global position errors, reflecting its higher-level physical coherence.
- Fast Adaptation and Retention: In adaptation benchmarks (LaFan1/MotionX as targets, AMASS as source), FAST demonstrates rapid convergence, best overall target-domain metrics, and substantially superior retention of source-domain performance relative to naive fine-tuning or unregularized residual adaptation. Ablative analysis confirms that joint Parseval and KL regularization are necessary for balancing adaptation speed with source policy preservation.

Figure 2: Fast adaptation performance on LaFan1 and MotionX, with concurrent preservation of source capability on AMASS.
Theoretical and Practical Implications
The combination of mixture-of-experts architectures, CoM/CoP-augmented policy design, and structured residual adaptation with orthogonalization/KL regularization establishes a principled template for scalable, robust humanoid control. By minimizing policy drift and ensuring smooth adaptation, FAST circumvents the primary failure modes often observed in naive fine-tuning and large-scale pretraining pipelines, particularly when faced with high-dimensional, low-quality, or adversarially perturbed motion distributions.
From a theoretical perspective, FAST demonstrates that regularized residual architectures can tightly bound sensitivity to data distribution shifts and parameter perturbations—critical for safe real-world deployment. Practically, the framework is immediately extensible to online and continual learning domains and robust teleoperation, especially in scenarios with unreliable input modalities.
Directions for Future Research
Key future directions involve:
Conclusion
FAST presents a robust, efficient methodology for general humanoid whole-body control, unifying large-scale, physically-grounded pretraining with stability-constrained fast adaptation. Through rigorous ablation and real-world experiments, it demonstrates superior robustness and adaptability over state-of-the-art universal trackers, providing a critical advance toward practical deployment of capable, resilient humanoid systems.