EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Published 4 Feb 2026 in cs.RO and cs.CV | (2602.04515v1)

Abstract: Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different types. Towards addressing these challenges, we propose a novel task - EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions. We further instantiate this task by introducing EgoActor, a unified and scalable vision-LLM (VLM) that can predict locomotion primitives (e.g., walk, turn, move sideways, change height), head movements, manipulation commands, and human-robot interactions to coordinate perception and execution in real-time. We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations, enabling EgoActor to make robust, context-aware decisions and perform fluent action inference (under 1s) with both 8B and 4B parameter models. Extensive evaluations in both simulated and real-world environments demonstrate that EgoActor effectively bridges abstract task planning and concrete motor execution, while generalizing across diverse tasks and unseen environments.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces EgoActor, a novel framework that directly grounds natural language into low-level egocentric actions for humanoid robots.
It employs a dual action representation—Structured Language Actions (SLAs) and Natural Language Actions (NLAs)—to effectively manage both navigation and manipulation tasks.
Extensive benchmarks show that EgoActor achieves superior success rates in obstacle avoidance, mobile manipulation, and human–robot interaction through advanced spatial reasoning.

EgoActor: Grounding Task Planning into Egocentric, Spatially-Aware Actions for Humanoid Robots

Introduction and Motivation

EgoActor addresses the persistent challenge of deploying humanoid robots in dynamic real-world environments, which demand tightly coupled perception, locomotion, manipulation, and interaction capabilities under egocentric, partially observable conditions. Traditional approaches to robot task planning either rely on modular controllers or restrict themselves to pre-defined skill libraries, limiting adaptability and the ability to generalize outside structured tasks. EgoActor confronts this by directly grounding natural language instructions into low-level, egocentric action sequences leveraging vision-LLMs (VLMs) and a newly defined EgoActing task. This formulation emphasizes spatial reasoning and cohesive control for movement, perception, manipulation, and human interaction, laying the groundwork for more robust, autonomous humanoid agency.

Figure 1: Visualization of EgoActor's working procedure for a given task: ``Approach and pick up the orange on the desk''. The grey blocks represent structured language actions (SLAs) and the green blocks represent natural language actions (NLAs).

Framework and Task Formulation

The EgoActing task is formalized as predicting the next action $a_t$ from the set of capabilities $\mathcal{A}$ , conditioned on the instruction $I$ , egocentric observation history $O_{1:t}$ , action history $a_{1:t-1}$ , and policy set $\Pi$ . EgoActor decomposes actions into two classes:

Structured Language Actions (SLAs): Concise, template-based primitives specifying movement or perception (e.g., "Turn left 30.5 degrees", "Move forward 0.6 meters"). These provide interpretable, spatially-grounded control for navigation and orientation.
Natural Language Actions (NLAs): Open-ended commands for manipulation and interaction (e.g., "Pick up the cup", "Say hi to the person"), supporting task flexibility and intent expression beyond fixed skill sets.

This dual representation supports robust, generalizable grounding from instruction to action.

Figure 2: Example natural language actions (NLA) in EgoActing. EgoActor is trained to predict the corresponding actions based on obtained RGB observations.

Model Architecture and Training Pipeline

EgoActor is instantiated as a transformer-based VLM (specifically, Qwen3-VL) finetuned via LoRA adaptation. Training is performed over diverse data modalities to enhance embodied spatial intelligence and action grounding:

Egocentric real-world videos (over 150,000 samples): Capturing authentic perceptual inputs and action transitions under varied layouts.
Internet-scale datasets (EgoTaskQA): Enriching the training distribution with diverse tasks and complex instructions.
Virtual environments (VLN-CE, Habitat-Sim): Providing structured spatial navigation trajectories for controlled supervision.
Spatial reasoning resources (MindCube): Augmenting the model’s understanding of spatial relations.
Visual-language understanding/planning datasets (GQA, RoboVQA, EgoPlan, ALFRED): Ensuring cross-task transfer and planning synergy.

Actions are predicted using uniform sampling over history and recent observations, allowing context-aware discrimination of next actions and facilitating sub-second inference latency on both 4B and 8B parameter scales.

Benchmarking and Empirical Results

Human-Robot Interaction

EgoActor demonstrates the capacity to resolve attribute-based referential ambiguity and perform seamless approach–interaction pipelines in both single- and multi-person scenarios, outperforming representative navigation-only VLMs (NaVid, Uni-NaVid, NaVILA) in success rates for both navigation and the subsequent interaction. The 8B variant shows significantly enhanced attribute disambiguation and dialogue act generation.

Mobile Manipulation

Under unseen object/layout conditions, EgoActor (especially the 8B model) achieves high success in both approach-and-pick and approach-and-place tasks, for both seen and out-of-distribution objects, confirming robust integration between navigation and fine manipulation cues.

Figure 3: An illustration of our model conducting the mobile manipulation task: ``Approach and grab the pink cup''.

Traversability and Obstacle Avoidance

Traversability experiments in narrow passages (room entry/exit, doorways) highlight EgoActor’s marked superiority in avoiding collisions relative to baseline VLN models. The model generalizes to novel obstacles and layouts, exploiting spatial reasoning learned from first-person data and adjusting behavioral primitives adaptively.

Figure 4: Multi-step illustration of obstacle avoidance generalization of our model, when faced with an unseen string obstacle.

Active Perception and Human-Like Behavior

EgoActor exhibits context-dependent active perception—modulating gaze, posture, and trajectory for task-relevant exploration or to maximize future manipulation success. These behaviors, captured in both real-world and simulation, include nuanced repositioning, joint control sequences, and human-like combined turning/strafe maneuvers.

Figure 5: First-person view of an EgoActor's active perception trace. Color description blocks highlight model's behaviors.

Figure 6: First-person view of an EgoActor's traversability trace, showing the robot walking through a doorway.

Figure 7: First-person view of an EgoActor's height change ability trace in virtual environments. Color description blocks highlight model's behaviors.

Strong Numerical Results and Claims

Across benchmarks, EgoActor (8B, real-world/virtual) achieves:

Traversability success rates >85% (doorways, unseen layouts), significantly surpassing all baselines (often by >30–40% absolute margin).
Mobile manipulation (unseen objects/scenes): 100% success in approach-and-pick tasks on some targets, with failures largely attributable to downstream skill execution rather than intent prediction.
Human–robot interaction: Perfect success in single-person approach + interaction, and robust performance in out-of-distribution multi-person scenarios.
Virtual environment ( $<$ 1.0m to goal): >70% success rates; >87% within 3m, substantially higher than baseline VLM navigation models, which plateau below 60% under lenient thresholds.

The results underscore superior transfer of abstract task specifications into actionable, egocentric motor sequences, with sharp improvements in spatial precision, attribute resolution, and qualitative smoothness of behavior.

Discussion: Theoretical and Practical Implications

EgoActor demonstrates a practical pathway for integrating high-level language planning and low-level whole-body control exclusively via egocentric RGB, without reliance on additional sensors or highly engineered skill libraries. The unified action representation enables seamless multi-modal transitions and reduces design/annotation overhead. The architecture’s scalability with data and model size, as well as its generalization across environments and instructions, indicates promise for future generalist humanoid agents.

Notably, existing VLN and classical navigation models—when retrofitted to direct robot control—struggle with spatial miscalibration, over-navigation, and insufficient action intent grounding, underscoring the necessity of architectures like EgoActor.

Limitations and Prospects

EgoActor depends heavily on the integrity and ability of downstream control polices (locomotion, manipulation, speech), and is not end-to-end from perception to physical actuation. Long-horizon compositional tasks and tightly coupled multi-agent scenarios remain challenging. Future extensions could integrate multi-modal sensory inputs, recurrent memory, or direct policy learning for closed-loop, fully unified control.

Ongoing research may enhance the framework by incorporating more diverse skills, real-world feedback (on-policy reinforcement or human-in-the-loop correction), and expanded action ontologies, ultimately facilitating more adaptive, safe, and autonomous humanoid behavior.

Conclusion

EgoActor establishes a new paradigm for grounding task planning into egocentric, spatially-aware whole-body actions in humanoid robots via vision-LLMs. Through a combination of unified action representation, diverse data curation, and rigorous benchmarking, it achieves robust generalization in dynamic, real-world settings. This approach robustly bridges the gap between high-level task abstraction and low-level execution, setting a new technical baseline for scalable, instruction-driven humanoid autonomy.

[The full details and benchmarks are available at (2602.04515).]

Markdown Report Issue