Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

Published 17 Feb 2026 in cs.RO, cs.AI, cs.LG, and eess.SY | (2602.15827v1)

Abstract: While recent advances in humanoid locomotion have achieved stable walking on varied terrains, capturing the agility and adaptivity of highly dynamic human motions remains an open challenge. In particular, agile parkour in complex environments demands not only low-level robustness, but also human-like motion expressiveness, long-horizon skill composition, and perception-driven decision-making. In this paper, we present Perceptive Humanoid Parkour (PHP), a modular framework that enables humanoid robots to autonomously perform long-horizon, vision-based parkour across challenging obstacle courses. Our approach first leverages motion matching, formulated as nearest-neighbor search in a feature space, to compose retargeted atomic human skills into long-horizon kinematic trajectories. This framework enables the flexible composition and smooth transition of complex skill chains while preserving the elegance and fluidity of dynamic human motions. Next, we train motion-tracking reinforcement learning (RL) expert policies for these composed motions, and distill them into a single depth-based, multi-skill student policy, using a combination of DAgger and RL. Crucially, the combination of perception and skill composition enables autonomous, context-aware decision-making: using only onboard depth sensing and a discrete 2D velocity command, the robot selects and executes whether to step over, climb onto, vault or roll off obstacles of varying geometries and heights. We validate our framework with extensive real-world experiments on a Unitree G1 humanoid robot, demonstrating highly dynamic parkour skills such as climbing tall obstacles up to 1.25m (96% robot height), as well as long-horizon multi-obstacle traversal with closed-loop adaptation to real-time obstacle perturbations.

Abstract PDF Upgrade to Chat

Summary

The paper presents a modular pipeline that integrates motion matching and RL-based distillation to chain dynamic human parkour skills.
It details a teacher-student framework where privileged RL experts train on simulated trajectories before transferring policies to real-world hardware.
Results demonstrate robust sim-to-real transfer with agile performance on complex maneuvers, reducing reliance on handcrafted transitions.

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

Overview and Motivation

"Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching" (2602.15827) presents a comprehensive pipeline for enabling humanoid robots to autonomously perform highly dynamic, whole-body parkour using only onboard depth sensing. The framework addresses long-standing challenges in humanoid control: robust execution of contact-rich maneuvers, long-horizon skill composition, and perceptually driven adaptation to real-world variability. The principal innovation lies in leveraging motion matching to compose demonstrations of atomic, human-like parkour skills into long-horizon kinematic references, serving as training signals for an RL-based teacher-student architecture. This modular strategy bypasses the need for laborious reward shaping while boosting generalization and transfer to hardware, culminating in robust and adaptive performance in real-world environments.

Figure 1: Overview of the PHP system, showing modular skill composition, simulation-trained RL experts, hybrid imitation/RL distillation, and zero-shot sim-to-real deployment with onboard perception.

Motion Matching for Long-Horizon Parkour Composition

The foundation of the system is a scalable kinematic skill composition module based on motion matching. Unlike prior frameworks that rely on reward-shaped monolithic policies or handcrafted transitions, motion matching operates by stitching together atomic parkour skills (retargeted human motions) via a nearest-neighbor feature search. Given a 2D velocity command and the robot’s current state, the system retrieves from a motion library the most contextually compatible next-frame, enabling seamless transitions across approach distances, timings, and obstacle configurations.

This process elegantly densifies the sparse sample distribution typical of difficult human motions, synthesizing diverse trajectories that cover heterogeneous approaches and contact timings even in low-data settings. The design naturally supports dynamic terrain randomization and exposure to task-relevant variations during training, which is indispensable for generating robust, adaptive policies.

Figure 2: Example synthesized parkour skill variations, illustrating adaptability to different approaches, speeds, directions, and randomized terrain geometries.

Teacher-Student Training Framework

To translate motion-matching trajectories into control policies, the system employs a two-stage RL-based training paradigm. For each trajectory, privileged-information RL experts are trained to track the composed kinematics in simulation, leveraging full-state observability for efficient learning and error correction. These single-skill experts are then distilled into a unified, multi-skill perceptive student policy.

The distillation process augments the classical DAgger objective with task-level PPO rewards, addressing a key limitation of pure imitation: per-step loss fails to incentivize the high-torque, temporally precise actions required for maneuvers such as vaults and wall climbs. The hybrid objective ensures both behavioral fidelity and episodic task success, with a curriculum to anneal from behavior cloning to RL-driven exploitation. Depth-based visual inputs are subjected to heavy domain randomization (camera noise, latency perturbations, extrinsic sampling), further ensuring sim-to-real robustness.

Real-World Demonstrations and Quantitative Performance

Zero-shot deployment of the student policy on a Unitree G1 humanoid demonstrates compelling real-world performance. The robot autonomously executes complex chains of parkour skills, including high wall climbs (1.25 m, 96% of robot height), diverse vaults, and reactive traversals of multi-obstacle courses with online adaptation to real-time terrain changes and obstacle perturbations.

Notably, the system enables smooth transitions between skills and demonstrates closed-loop adaptation capabilities, e.g., by correcting approach when obstacles are suddenly displaced. These results are strong numerical evidence of robust transfer and agility, with trajectory times and outcomes comparable to human experts on identical tasks.

Figure 3: Hardware demonstration of high-climb agility, with the robot climbing a 1.25 m wall in 3.63 seconds.

Figure 4: Additional hardware results—cat vault, landing from height, and extended parkour with real-time adaptation in complex terrain.

Ablative and Comparative Studies

Experiments systematically ablate motion-matching density, network capacity, and the role of RL in distillation. Results indicate:

Motion matching density is critical: Reducing approach diversity in reference data sharply degrades success rates, especially in timing-sensitive skills.
DAgger-only distillation is insufficient: Without RL, policies lack the episodic correction required for high-dynamic maneuvers; stalled or incomplete executions dominate.
Scalability: Increased model size and rollout environments improve performance, especially on more complex skills.
Robustness of RL reward design: Even coarse alive/progress rewards suffice during distillation, as RL supplements DAgger by favoring skill completion.

Compared to strong baselines (vanilla reward-shaped RL, uncomposed data, AMP), the PHP framework establishes a substantial performance gap on tasks involving high obstacles and fast transitions. The AMP baseline, in particular, fails to execute agile contact-rich behaviors reliably.

Implications and Future Directions

The PHP framework delivers a modular and scalable recipe for transferring human-like dynamic skill composition to perception-driven humanoid control. The results underscore three central insights:

Densifying low-sample expert motion data via motion matching is necessary and sufficient for robust, whole-body skill chaining.
RL-based distillation is essential for overcoming imitation learning’s myopia in highly dynamic domains.
Curriculum-driven blending of imitation and RL maintains robustness without sacrificing kinematic fidelity.

Theoretically, the work bridges the state-distribution mismatch in long-horizon, multi-skill settings by exposing visuomotor policies to synthesised, context-diverse reference distributions. Practically, it validates scalable sim-to-real transfer for intricate humanoid tasks, reducing reliance on manual reward engineering and handcrafted transitions.

Looking forward, several avenues emerge:

Incorporation of semantic scene understanding (potentially via language-conditioned policies) could further enhance the richness of behaviors and contextual reasoning.
Hardware limitations (e.g., hand actuation, sensor coverage) are bottlenecks for extending to more extreme maneuvers such as bar swings or overheight climbs.
Incorporating broader diversity in the motion database or leveraging generative kinematic models for transition synthesis could further generalize the approach.

Conclusion

"Perceptive Humanoid Parkour" synthesizes a strong, scalable pipeline for autonomous, perception-driven whole-body parkour in humanoid robots. By integrating motion matching, RL-based motion tracking, and hybrid DAgger-RL distillation, the framework achieves robust policy transfer and adaptive long-horizon behavior, validated through state-of-the-art results in both simulation and hardware. The work sets a new standard for dynamic skill chaining and suggests promising directions for future research in scalable, data-efficient humanoid control.

Markdown

Paper to Video (Beta)

All Videos Create Your Own

Whiteboard

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper shows how a human-shaped robot (a humanoid) can do parkour—fast, agile moves like vaulting over obstacles and climbing tall walls—by seeing the world with its own camera and choosing the right skills on the fly. The team built a system called Perceptive Humanoid Parkour (PHP) that lets one robot “brain” smoothly chain many human-style moves together and adapt to obstacles it sees in front of it.

What did the researchers want to find out?

They wanted to make humanoid robots:

Move with human-like agility, not just walk or step slowly.
Decide which skill to use (step, climb, vault, roll) based on what the robot’s camera sees.
Stitch many short, dynamic moves into a long, smooth run without awkward pauses.
Learn in simulation and then work in the real world without extra fine-tuning.

How did they do it?

The approach combines ideas from animation, teaching, and trial-and-error learning. Here’s the recipe, explained in everyday terms.

Teaching the robot human moves with “motion matching”

Think of “motion matching” like editing a highlight reel: you have a library of short human motion clips (tiny video-like snippets of specific skills), and you pick the next clip that best fits what you’re doing now and where you want to go.
The robot uses this to plan long, smooth sequences: run → do a parkour skill → run again. Running acts like a universal “connector” between skills.
Even if the original human recordings are short and scarce, motion matching mixes and stitches them into many different transitions (for example, entering a vault from different speeds or foot positions). This gives the robot lots of practice examples with varied approaches and timings.

Training “expert teachers,” then a single “student”

First, they trained several “expert” controllers in simulation—each expert focuses on tracking one composed motion extremely well (like a perfect coach for climbing or vaulting). These experts get extra state information that the real robot wouldn’t have; this is called “privileged” info.
Next, they taught one “student” controller that works with the robot’s actual sensors (a depth camera) to imitate the experts across all skills.
They used two learning styles together:
- Imitation learning (DAgger): the student copies the experts’ actions, like following a teacher’s demonstrations.
- Reinforcement learning (PPO): the student also gets rewards for success, pushing it to do what actually works, not just what looks close to the expert.
Why both? Copying alone can make the student too cautious. Rewards help it commit to powerful, well-timed actions (like the brief, strong pull needed to climb a wall) that make the move succeed.

Giving the robot “eyes” and making it transfer to reality

The robot uses a depth camera, which measures how far things are—like a dot-based 3D view of obstacles.
They carefully simulated the camera’s position, noise, and delay to match the real hardware, so the student trained in simulation could work on the real robot “zero-shot” (without extra fine-tuning).

What did they discover?

The system worked on a real Unitree G1 humanoid robot and achieved strong results:

Human-level agility on a tall wall:
- The robot climbed onto a 1.25 m wall (about 96% of its height) in about 3.6 seconds from takeoff—comparable timing to a trained human doing the same sequence.
High-speed vaults and dynamic moves:
- The robot performed vaults at roughly 3 m/s, clearing obstacles quickly while keeping forward speed for continued running.
Long, continuous runs with smooth transitions:
- It completed a 48–60 second parkour course with multiple obstacles, switching skills smoothly while running.
Real-time adaptation:
- If obstacles were moved mid-run (about 0.5 m shifts), the robot adjusted its approach and timing and continued the traversal.
Strong performance compared to baselines in simulation:
- Methods that relied only on reward shaping or only on raw motion clips struggled on higher obstacles.
- The PHP approach had very high success rates across different heights and speeds, especially on the most dynamic skills (vaults and climbs).

Why does this matter?

This research shows a practical path to truly agile humanoid movement:

It solves a common data problem: dynamic human motion clips are short and scarce. Motion matching turns them into many usable, long, varied sequences without needing massive new datasets.
It blends perception and decision-making: the robot doesn’t just run a script; it sees obstacles and chooses the right skill in the moment.
It scales to many skills: combining expert “teachers” with a single “student” controller means one brain can do many complex moves smoothly.
It transfers from simulation to reality: careful camera modeling and training design make the jump to hardware possible without extra fine-tuning.

In the future, this could help humanoid robots handle complex, cluttered spaces—like warehouses, disaster zones, or construction sites—where fast, whole-body agility and smart, perception-driven decisions are essential. Adding better sensors and grippers, or even language guidance, could expand to more advanced climbing and richer scene understanding, bringing robots closer to human-level movement in the real world.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper, framed as concrete questions or limitations to guide future research:

Limited semantic understanding: the policy relies on depth-only geometry without semantic reasoning (e.g., distinguishing edges, rails, bars, or affordances). How to integrate semantics (segmentation, affordance detection) or language-conditioned goals to drive skill selection and style?
Short-range, narrow-FOV perception at high speeds: obstacles may not be visible early enough for safe commitment. What perception stacks (multi-camera, panoramic fisheye, stereo/LiDAR fusion, event cameras) and look-ahead planners are needed to increase preview distance?
No global mapping or memory: decisions are based on instantaneous frames; there is no SLAM or temporal memory to reason about occlusions or future landing zones. Can memory-augmented policies or map-based planning improve long-horizon decision-making?
Hardware limitations in hands/grippers: current end-effectors limit edge-, rail-, or bar-based maneuvers (e.g., hangs, dynos, over-height climbs). What gripper designs and controllers enable reliable high-force, high-rate hand contacts?
Generalization beyond the motion library: the skill set is limited to a few parkour primitives and retargeted clips. How to scale to dozens or hundreds of skills, and to unseen obstacles and tasks not represented in the library?
Manual annotation burden: skill start/end frames and pre-skill “entry windows” are manually specified. Can automated phase/contact detection learn entry/exit points and windows from data?
Hand-crafted motion-matching features: nearest-neighbor retrieval uses designed features and fixed windows. Would learned embeddings, differentiable retrieval, or terrain-conditioned retrieval improve transition quality and coverage under sparse data?
Composition via locomotion only: transitions are constrained to “locomotion → skill → locomotion.” How to enable direct transitions between non-locomotion skills (e.g., vault → climb) without requiring locomotion interludes?
Terrain fitting for references: references are generated by placing terrain relative to the motion clip; real-world obstacles won’t match this placement. How to perform on-the-fly trajectory fitting/warping to sensed geometries with contact-consistency guarantees?
Limited terrain randomization: geometry perturbations are small (±5 cm, yaw ±45°) and widths capped; slippery, compliant, deformable, or uneven surfaces are not considered. How to broaden randomization and validate on OOD terrains?
Sensor artifact modeling: depth noise is injected but excludes motion blur and realistic outdoor artifacts (e.g., sunlight interference, multipath, rolling shutter). How to robustify to real sensor degradation and environment-specific artifacts?
Sim-to-real contact fidelity: contact modeling (friction, compliance, hand-surface interactions) is not deeply analyzed. What physical parameters most affect transfer, and can tactile feedback or online adaptation reduce mismatch?
Success vs. style: human-like expressiveness is claimed but not quantified; trade-offs between task success and motion style are not measured. What metrics (e.g., style classifiers, kinematic similarity) or constraints can enforce expressiveness?
Real-world evaluation scope: hardware results are qualitative with limited task diversity and no success-rate statistics or failure breakdowns across environments. Establish standardized benchmarks and report quantitative performance (success, recovery, falls, energy, impact forces).
Safety and risk management: fall detection, impact mitigation, and safe recovery behaviors are not discussed. How to integrate risk-aware control, safety envelopes, and certified recovery policies at high speeds?
Latency and onboard compute: real-time inference rates, latency budgets, and compute platforms are not detailed. What model compression or scheduling is needed to guarantee bounded end-to-end latency at higher speeds?
Sample efficiency and compute cost: training uses 16,384 environments and large networks; time-to-train and data budgets are not reported. Can offline RL, dataset aggregation, or distillation-efficient methods reduce compute?
DAgger–RL curriculum sensitivity: the weighting schedule between imitation and RL is hand-tuned; stability and convergence are not theoretically grounded. Can uncertainty-aware or adaptive weighting strategies reduce tuning burden?
Handling multimodality and symmetry: left/right mirrored executions are addressed by relaxed termination but not modeled explicitly. Would symmetry-aware objectives, latent-mode policies, or mixture-of-experts improve stability and coverage?
Continual learning of new skills: the impact of adding skills over time (catastrophic forgetting, interference) is not studied. How to support modular expansion, incremental distillation, or skill-slotting without retraining from scratch?
End-to-end visuomotor training: the end-to-end baseline underperforms but leaves open whether better exploration (e.g., curriculum, demonstrations, diffusion-guided exploration) can rival expert-distilled policies without per-skill teachers.
Cross-robot transfer and morphology: results are limited to a Unitree G1; portability to other humanoids with different dimensions, actuation, or torque limits is untested. What normalization/retargeting strategies enable zero-shot cross-morphology transfer?
Dynamic/moving obstacles: only small, manual obstacle displacements are tested; reactive parkour with moving obstacles or crowds is unexplored. How to integrate prediction and online re-timing for moving environments?
Long-duration robustness: demonstrations up to ~60 s leave open questions about drift, thermal limits, battery, and cumulative error over minutes. What mechanisms (re-localization, periodic re-synchronization) maintain performance over extended runs?
Depth-only dependence: failure modes with transparent/reflective materials, low-texture surfaces, or depth dropouts are unaddressed. How to fuse RGB, IMU, proprioception, and possibly tactile to recover under depth failures?
Hyperparameter and design choices: action scaling, termination thresholds, and motion-tracking reward compositions are fixed; ablations are limited. What principled methods choose these automatically for new skills?
Benchmarking against alternative kinematic generators: comparisons to diffusion-based trajectory generators (e.g., MDM) or AMP-style multi-skill controllers are limited. Rigorous head-to-head comparisons could clarify when motion matching is preferable.
Reference-timing generalization: performance hinges on dense approach-distance coverage during motion matching; generalizing contact timing from sparse data remains open. Can timing policies be learned to extrapolate beyond observed approach conditions?
Interpretability and monitoring: the unified policy implicitly selects skills; there is no explicit skill label stream for introspection or safety monitoring. How to expose interpretable skill/phase estimates during execution?
Reproducibility and community benchmarks: the motion library, composition tools, and datasets are not stated as publicly available. Open-sourcing standardized libraries and test suites would enable fair comparisons and rapid progress.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now in controlled or semi-structured environments, leveraging the paper’s motion-matching skill composition and depth-based, multi-skill policy with zero-shot sim-to-real transfer.

Humanoid parkour showcase and experiential demos
- Sector: entertainment, marketing, education
- Use case: Deploy Unitree G1 or similar humanoids to perform dynamic, long-horizon demonstrations (vaults, climbs, sustained multi-obstacle runs) at trade shows, theme parks, campus open houses, and public STEM events.
- Potential tools/products/workflows: “Parkour Demo Pack” with precomposed motion-matching trajectories and a unified student policy; simple velocity teleoperation UI; safety geofencing tooling; standard obstacle course configurations with documented dimensions.
- Assumptions/dependencies: Controlled courses and crowd management; obstacle geometries within trained ranges (e.g., up to ~1.25 m wall, specific widths/yaws); adequate actuator power and battery capacity; liability coverage and emergency stop procedures.
Humanoid mobility in cluttered industrial testbeds
- Sector: robotics, manufacturing/logistics
- Use case: Traverse low platforms, pallets, shallow ledges, or ad-hoc obstacles in test environments to evaluate mobility readiness (e.g., stepping onto low platforms or short climbs), without task-side manipulation.
- Potential tools/products/workflows: ROS-integrated “Perceptive Teleop Assist” (operator sends 2D velocity; policy autonomously selects step/climb/vault); motion-matching trajectory generator tailored to facility mock-ups; rapid camera calibration and depth-noise modeling toolkit for sim-to-real.
- Assumptions/dependencies: Operation in fenced test areas; depth camera field-of-view and latency calibrated per paper’s WARP-based pipeline; obstacles within policy’s trained shape/pose perturbations; no human cohabitation during high-energy maneuvers.
Search-and-rescue (SAR) training grounds evaluation
- Sector: public safety, emergency response
- Use case: Evaluate humanoids’ ability to traverse rubble-like obstacle courses (low-height climbs, narrow platforms) for mobility baselining in SAR training fields.
- Potential tools/products/workflows: Parkour-style mobility benchmark and evaluation rubric; course setup guide with randomized obstacle yaw/width; unified student policy with adaptive timing under obstacle displacement.
- Assumptions/dependencies: Non-deployed evaluation only; obstacle sizes matched to the trained distribution; limited environmental variability (lighting/weather); remote operation with emergency stop.
Academic benchmarking for long-horizon skill chaining
- Sector: academia, research
- Use case: Standardize evaluation of multi-skill chaining and sim-to-real transfer for humanoid control, comparing motion-matching vs. learned kinematics vs. reward-shaped RL.
- Potential tools/products/workflows: Open-sourced motion-matching dataset synthesis scripts; “atomic skill” libraries with entry windows; teacher-student distillation code (DAgger+PPO curriculum); WARP-based depth rendering for high-throughput simulation.
- Assumptions/dependencies: Access to compute (≥16k envs ideal); retargetable human motion data and terrain assets; Unitree G1 or similar platform for hardware trials.
Robotics education modules
- Sector: education
- Use case: Course labs teaching motion retargeting, motion matching, expert tracking, and hybrid DAgger+RL distillation for multi-skill policies.
- Potential tools/products/workflows: Modular teaching kit (OmniRetarget → motion matching → expert RL → distillation → sim-to-real); calibrated depth camera emulator; scaffolded assignments on ablations (dataset density, reward designs).
- Assumptions/dependencies: Simulation-first delivery; optional hardware extension for advanced cohorts; curated datasets of safe skills.
Tooling for sim-to-real depth perception and calibration
- Sector: software, robotics
- Use case: Adopt camera extrinsic randomization, depth noise injection, and latency modeling across robot projects to close perception gaps without costly photorealistic rendering.
- Potential tools/products/workflows: “Depth Transfer Toolkit” (WARP rendering, ROI-based self-visibility matching, latency randomization, artifact injection); automated calibration scripts.
- Assumptions/dependencies: Known sensor model; consistent mounting; validation poses; developer familiarity with Python/CUDA workflows.

Long-Term Applications

These applications require additional research, scaling, and engineering, particularly in perception (long-range sensing, semantics), manipulation (strong hands/grippers), safety-certification, and reliability for unstructured environments.

Urban search-and-rescue deployment in real incidents
- Sector: public safety
- Use case: Autonomous traversal over rubble, collapsed structures, and complex 3D obstacles (climb-ups, vaults, drops) to reach victims or carry sensors.
- Potential tools/products/workflows: Robust multi-sensor perception stack (depth + LiDAR + stereo + thermal); semantic scene understanding; fall risk estimation; procedural skill guardrails; “mission policy” blending mobility with situational heuristics.
- Assumptions/dependencies: Strong hands or grippers for edge support and ladder climbing; weatherproofing; reliability under dust/smoke; human–robot teaming protocols; certification and incident command integration.
Construction and industrial maintenance mobility
- Sector: construction, energy, industrial inspection
- Use case: Navigating scaffolding, platforms, and irregular plant architectures; climbing onto mezzanines; negotiating obstacles to reach inspection points.
- Potential tools/products/workflows: “Mobility Skill Graph” composer integrating parkour skills with compliant manipulation (rail grabs, bar hangs); state estimation under occlusions; route planning with skill-aware timing; fall-arrest systems and tethering.
- Assumptions/dependencies: Higher-actuation strength and endurance; grasping/hooking capabilities; rigorous safety compliance; risk assessment and site-specific training.
Logistics and facilities operations in complex buildings
- Sector: logistics, warehousing
- Use case: Multi-floor access without standard infrastructure (stairs, ad-hoc hurdles), emergency deliveries, or traversal during outages.
- Potential tools/products/workflows: Language-conditioned high-level task control (e.g., “deliver to floor 3 via west stairwell”), semantic mapping, human-aware navigation and speed-governing, multi-skill motion planning under traffic and clutter.
- Assumptions/dependencies: Safety certification for operation near workers; broader perception coverage; operational policies for dynamic maneuvers; battery management and thermal constraints.
Multi-skill loco-manipulation for complex tasks
- Sector: robotics, software
- Use case: Combine parkour-like mobility with whole-body manipulation (opening doors, ladder ascent, rail support) for full-scene task execution.
- Potential tools/products/workflows: Extended motion library via OmniRetarget for interaction-heavy maneuvers; motion-matching over heterogeneous skill families; hierarchical teacher-student distillation; “skill composer” SDK with semantic preconditions.
- Assumptions/dependencies: Force-aware hands, compliant control, rich scene semantics; co-training perception and control; model-based safety checks.
Standards, certification, and safety governance for high-energy humanoid motions
- Sector: policy/regulation
- Use case: Establish testing protocols and certification tiers for dynamic behaviors (vaults, climbs, drops) across public, industrial, and emergency settings.
- Potential tools/products/workflows: Standardized obstacle courses and metrics (success rate, timing, contact forces, fall risk); reporting schemas; incident logs; insurance and liability frameworks.
- Assumptions/dependencies: Cross-stakeholder collaboration (manufacturers, regulators, unions, safety engineers); public acceptance; legal clarity.
Stunt and live-performance robots
- Sector: entertainment
- Use case: Reliable robotic performances with dynamic sequences (vault chains, wall climbs, rolls) synchronized to show cues, with contingency handling for obstacle shifts.
- Potential tools/products/workflows: Redundant sensing, preflight validation, rehearsal pipelines with motion matching; runtime monitors for energy, joint temperatures, and skill aborts; show safety standards.
- Assumptions/dependencies: High mechanical reliability; rapid recovery behaviors; liability and venue compliance.
Exoskeleton and assistive device control inspired by motion matching and skill chaining
- Sector: healthcare, rehabilitation
- Use case: Adaptive assistance for dynamic transitions (curb steps, low platform climbs) using motion-matching features and success-aware control objectives.
- Potential tools/products/workflows: Human-in-the-loop motion feature design; patient-specific retargeting; safety envelopes; gradual RL curricula tuned to clinical constraints.
- Assumptions/dependencies: Rigorous clinical validation; fail-safes; ethical approvals; user comfort and trust.
Digital twins and sports analytics using physically-executable skill compositions
- Sector: software, sports tech
- Use case: Generate physically plausible long-horizon motion sequences for athletes or avatars (e.g., parkour lines) to assess timing, contact strategies, and risk.
- Potential tools/products/workflows: Motion-matching sequence planner; simulation-based stress/impact estimators; coaching dashboards; integration with biomechanical models.
- Assumptions/dependencies: High-quality motion capture; validation against human performance; careful interpretation for training recommendations.

Cross-cutting assumptions and dependencies

Motion data: Availability of curated human motion clips and accurate retargeting (e.g., OmniRetarget), including annotated entry/exit windows for each skill.
Perception: Depth camera sufficiency in FOV, range, and latency; additional sensors needed for unstructured environments (LiDAR, IMU fusion, stereo).
Hardware: Actuator power to deliver brief, high-magnitude actions; stronger hands/grippers for advanced climbing; heat and battery management.
Safety: Operation in controlled areas now; future certification for human-cohabited spaces; emergency stop and geofencing; hazard-aware speed limiting.
Scalability: Compute resources for high-throughput training (thousands of envs), domain randomization tuning, and robust sim-to-real procedures.
Semantics and planning: Current policy relies on depth and velocity command only; broader deployment requires semantic scene understanding, higher-level planning, and language or task conditioning.

View Paper Prompt View All Prompts

Glossary

Action scale: A normalization factor applied to action outputs to control their magnitude during policy training and execution. "Due to challenging RL exploration, we set the action scale to 1 for all experts, instead of the heuristics used in~\cite{liao2025beyondmimic}."
Adaptive Sampling: A training strategy that prioritizes sampling from states or segments where policies fail, to accelerate learning of difficult behaviors. "Adaptive Sampling is essential for learning difficult skills in expert training, which prioritizes sampling from regions that fail more frequently."
Alive reward: A simple reinforcement learning reward that encourages the agent to remain operational and make progress, without detailed tracking objectives. "DAgger + RL Alive Reward. Use only an alive/progress reward, without motion-tracking terms."
AMP: Adversarial Motion Priors; a method that learns motion-style rewards from data to guide RL in training policies that match human motion distributions. "AMP~\cite{peng2021amp} addresses this challenge by training a single policy to learn a distribution of skills, allowing transitions to implicitly emerge from RL exploration, but replaces hand-crafted rewards with a learned style reward from motion data."
Behavior cloning: Supervised learning approach where a policy imitates expert actions given states, often used to regularize or initialize RL. "the behavior cloning objective provides a critical regularization for RL."
Blending window: A short temporal smoothing interval applied at motion transitions to avoid discontinuities between clips or frames. "A short blending window is applied around transitions to avoid discontinuities."
Camera extrinsics: The camera’s position and orientation relative to the robot or world frame; often randomized to improve robustness to mounting variability. "we randomize camera extrinsics within 2.5 cm translation and 2.5Â° rotation around the calibrated value to improve robustness to viewpoint shifts and mounting variability."
CNN: Convolutional Neural Network; a deep learning architecture particularly suited for processing image-like inputs such as depth maps. "we use a 3-layer CNN and a 5-layer MLP with hidden sizes [2048, 1024, 512, 256, 128], trained with 16,384 parallel environments."
Closed-loop adaptation: Real-time feedback control that adapts behavior during execution in response to perceived changes or disturbances. "demonstrating highly dynamic parkour skills such as climbing tall obstacles up to 1.25\,m (96\% robot height), as well as long-horizon multi-obstacle traversal with closed-loop adaptation to real-time obstacle perturbations."
DAgger: Dataset Aggregation; an imitation learning algorithm that iteratively trains on the learner’s state distribution with expert labels to reduce compounding errors. "and distill them into a single depth-based, multi-skill student policy, using a combination of DAgger and RL."
Domain Randomization: Systematic randomization of simulation parameters (e.g., visual, physical) during training to improve robustness and transfer to the real world. "Rewards, Terminations, and Domain Randomization follow BeyondMimic~\cite{liao2025beyondmimic}: DeepMimic-style tracking rewards with action rate, joint limits, and collision penalties, tracking-based early termination, and lightweight randomizations."
DoFs: Degrees of Freedom; the number of independent joint or body motions a robot can perform. "We evaluate the proposed framework through a series of simulation and real-world experiments on a Unitree G1 humanoid (1.3\,m tall with 29 DoFs)."
Exteroception: Sensing external environmental cues (e.g., vision, depth) used to adapt motion to terrain and obstacles. "these skills must be tightly coupled with exteroception, such as vision, to enable adaptation to environmental variation and rapid reaction to unexpected perturbations."
Feature space: A vector space of motion features (e.g., trajectories, joint positions) where similarity search (nearest neighbor) guides motion matching. "leverages motion matching, formulated as nearest-neighbor search in a feature space, to compose retargeted atomic human skills into long-horizon kinematic trajectories."
Global tracking: Tracking a reference motion in world coordinates (not just locally), enabling recovery behaviors and alignment with terrain. "Unlike~\cite{yang2025omniretarget}, we enable global tracking with privileged observations (pelvis global position and velocity) so the expert can learn recovery behaviors."
Height scan: A local height-map measurement (grid of terrain elevations) around the robot, used by policies to adapt to terrain variations. "We additionally provide the expert with a 0.7 m Ã 0.7 m height scan, allowing it to adapt to terrain randomizations."
IsaacLab: A robotics learning framework from NVIDIA used for training and evaluating RL policies with physics simulation. "We train a humanoid to traverse terrain using IsaacLabâs \cite{mittal2025isaac} standard velocity-tracking RL pipeline."
Kinematic trajectories: Time sequences of poses (positions and orientations of joints/root) used as motion references without modeling full dynamics. "compose retargeted atomic human skills into long-horizon kinematic trajectories."
KL-based exploration control: Using Kullback–Leibler divergence constraints to regulate policy updates and exploration in PPO. "Finally, we enable adaptive learning rate and KL-based exploration control only when $\lambda_{\text{PPO}$ exceeds 0.1."
Locomotion manifold: The shared space of locomotion states used to connect and transition between heterogeneous dynamic skills. "By routing all skills through a shared locomotion manifold, this formulation enables consistent transitions across heterogeneous skills without requiring specific, hand-captured transitions between every possible skill pair, and supports scalable composition of long-horizon behaviors."
MLP: Multilayer Perceptron; a feedforward neural network commonly used for processing proprioceptive inputs and outputs actions. "we use a 3-layer CNN and a 5-layer MLP with hidden sizes [2048, 1024, 512, 256, 128], trained with 16,384 parallel environments."
Motion matching: A data-driven motion synthesis technique that stitches motion fragments by nearest-neighbor matching of features, producing realistic transitions. "Motion matching synthesizes long-horizon motion by retrieving and stitching motion fragments via nearest-neighbor search in a designed feature space."
Motion priors: Prior knowledge or distributions over human motion styles used to regularize or guide policy learning. "we present Perceptive Humanoid Parkour (PHP), a modular framework that integrates human motion priors, long-horizon skill composition, and perceptive control."
Nearest-neighbor search: A technique to find the most similar item (frame/clip) in a feature space, used to select motion transitions. "Motion matching synthesizes long-horizon motion by retrieving and stitching motion fragments via nearest-neighbor search in a designed feature space."
Nvidia WARP: A high-performance GPU framework used here to render depth images at scale for training. "For vision, we use depth images rendered with Nvidia WARP~\cite{warp2022} for high-throughput training."
Observation delay: Latency between sensing and policy action due to hardware/software pipelines; often randomized in sim for robustness. "Finally, we randomize observation delay between 60 ms and 80 ms to simulate hardware latency fluctuations."
OmniRetarget: A method to retarget human motion to robot kinematics while preserving interaction and style. "We first retarget human motion data into a library of robot-compatible atomic skills using OmniRetarget \cite{yang2025omniretarget}."
PD targets: Proportional–Derivative controller targets for joints; actions specify desired joint positions/velocities for PD control. "Actions are joint PD targets normalized by a fixed action scale."
Pelvis gravity vector: The gravity direction expressed in the pelvis/body frame; used as a proprioceptive signal. "The policy observes proprioception signals including pelvis gravity vector and angular velocity, joint positions and velocities, and the previous action."
Privileged observations: State variables available in simulation (e.g., global pose) but not on real hardware; used to train teacher experts. "We then train motion-tracking expert policies with privileged observations in simulation, and finally distill them into a depth-based student policy..."
Proprioception: Internal sensing of the robot’s own state (e.g., joint angles, velocities, body orientation). "The policy observes proprioception signals including pelvis gravity vector and angular velocity, joint positions and velocities, and the previous action."
PPO: Proximal Policy Optimization; a stable RL algorithm that constrains policy updates to improve learning reliability. "using DAgger in combination with a PPO objective, enabling zero-shot sim-to-real deployment."
Receding-horizon replanning: Planning over a moving short time window repeatedly during execution; computationally costly with perception. "This often requires either costly iterative co-training~\cite{xu2025parc} to recover usable motion or receding-horizon replanning~\cite{huang2025diffuse}, which is costly with perception in real time."
RL: Reinforcement Learning; training policies via reward signals and trial-and-error interaction with environments. "we augment distillation with an RL objective that provides task-level corrective feedback, steering the student towards successful traversal and yielding a scalable recipe across many skills."
Reward shaping: Designing specific reward components to elicit desired behaviors in RL, often used when training from scratch. "While quadrupedal parkour skills can often be trained from scratch via reward shaping, this approach scales poorly to humanoids due to high-dimensional whole-body control."
ROI overlap: Region of Interest overlap; a metric to match visibility between simulation and hardware for camera calibration. "We calibrate the simulated camera by matching robot self-visibility across a set of poses in simulation and hardware using ROI overlap, and randomize camera extrinsics..."
Root tracking reward: A reward focused on tracking the root (base) pose/trajectory rather than the full-body motion. "Use a root-tracking reward instead of full whole-body tracking."
Root velocity: The linear velocity of the character’s root (base) used as a motion feature for matching. "and (iii) root velocity, all expressed in the character's local coordinate frame."
Sim-to-real transfer: Deploying policies trained in simulation onto real hardware effectively. "Successful zero-shot sim-to-real transfer of depth-based policies on a physical humanoid robot, achieving highly dynamic parkour over various obstacles."
Teacher-student training pipeline: A framework where expert (teacher) policies guide training of a single student policy, often via imitation and RL. "By composing various agile human skills via motion matching and a teacher-student training pipeline, we train a single multi-skill visuomotor policy..."
Visuomotor policy: A control policy that maps visual inputs (e.g., depth) and proprioception to motor actions. "we train a single multi-skill visuomotor policy capable of complex contact-rich maneuvers..."
Yaw randomization: Randomly perturbing obstacle orientation around the vertical axis to diversify training conditions. "Each task requires the robot to move forward at a fixed command speed (1.0\,m/s or 2.0\,m/s) and clear a single obstacle of a specified height and $20^\circ$ yaw randomization."
Zero-shot: Deploying a model on new tasks or hardware without fine-tuning or additional training. "Successful zero-shot sim-to-real transfer of depth-based policies on a physical humanoid robot"

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

Summary

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

Overview and Motivation

Motion Matching for Long-Horizon Parkour Composition

Teacher-Student Training Framework

Real-World Demonstrations and Quantitative Performance

Ablative and Comparative Studies

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What did the researchers want to find out?

How did they do it?

Teaching the robot human moves with “motion matching”

Training “expert teachers,” then a single “student”

Giving the robot “eyes” and making it transfer to reality

What did they discover?

Why does this matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Authors (12)

Collections

Tweets

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

Summary

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

Overview and Motivation

Motion Matching for Long-Horizon Parkour Composition

Teacher-Student Training Framework

Real-World Demonstrations and Quantitative Performance

Ablative and Comparative Studies

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What did the researchers want to find out?

How did they do it?

Teaching the robot human moves with “motion matching”

Training “expert teachers,” then a single “student”

Giving the robot “eyes” and making it transfer to reality

What did they discover?

Why does this matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (12)

Collections

Tweets