Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

Published 13 Feb 2026 in cs.RO | (2602.13193v1)

Abstract: Pretrained vision-LLMs (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io

Abstract PDF Upgrade to Chat

Summary

The paper introduces Steerable Policies that use diverse synthetic steering commands to enhance low-level robotic control and embodied reasoning.
It presents a scalable synthetic annotation pipeline that boosts training data diversity, enabling robust hierarchical control under varying command abstractions.
Experimental results show that both fine-tuned embodied reasoning and in-context learning VLMs significantly outperform state-of-the-art methods on generalization tasks.

Steerable Vision-Language-Action Policies: Enhancing Embodied Reasoning and Hierarchical Control

Motivation and Background

Robotic policies capable of following open-ended commands for diverse tasks are constrained by the practical limitations of dataset curation and annotation. Pretrained Vision-LLMs (VLMs) provide semantic and perceptual priors, yet existing hierarchical robotic control architectures typically interface VLMs with low-level controllers using sparse task-level language. This fundamentally limits the transfer of generalization, reasoning, and scene understanding capabilities from VLMs to robot actions. The paper "Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control" (2602.13193) addresses this bottleneck by introducing Steerable Policies, a new class of Vision-Language-Action models (VLAs) trained to accept a wide spectrum of synthetic steering commands, ranging from subtasks to grounded pixel coordinates.

Steerable Policy Architecture and Command Abstractions

The core contribution is the enhancement of low-level policy controllability via diverse steering command modalities, overcoming dataset bias and linguistic underspecification. Through augmentation of robot trajectory data with synthetically generated commands, Steerable Policies can respond to:

Task-level language (standard robotic instructions)
Subtask decomposition (semantic decomposition for behavior composition)
Atomic motion commands (fine-grained control without scene semantics)
Gripper traces (sequences of visual coordinates for trajectory following)
Pointing commands (pixel position-based directives for object localization)
Combined hybrid commands (composition of modalities to resolve ambiguity and induce complex behaviors)

This broad interface allows high-level controllers to select commands of the optimal abstraction to maximally exploit pretrained VLM capabilities. Steerable Policies thus enable nuanced control via hierarchical policy inference loops—where high-level models send commands of varying abstraction to the low-level Steerable Policy.

Figure 1: The hierarchical policy inference loop, where a high-level model sends commands to the low-level Steerable Policy.

Scalable Synthetic Command Annotation Pipeline

To scale the training of Steerable Policies, the paper describes an automated pipeline leveraging foundation models for trajectory annotation. Grounded features (bounding boxes, object traces, motion language) are programmatically extracted and combined with VLM queries (e.g., Gemini) to generate synthetic steering commands and post-hoc rationalizations, vastly expanding the diversity of training prompts. This achieves a significant increase in labeled data volume and abstraction range, providing the essential supervision needed for hierarchical embodied reasoning.

Figure 2: Automated pipeline for annotating robot data with synthetic steering commands, extracting features and generating diverse prompts for VLA training.

Hierarchical Policy Instantiations with VLMs

Two high-level control regimes are considered:

Fine-tuned embodied reasoning VLM: A lightweight open-source VLM is trained to decompose tasks into steering commands with explicit rationales, aligning the training objective closely with foundation model pretraining objectives and leveraging reasoning traces for improved generalization.
In-context learning with off-the-shelf VLMs: API-based VLMs (e.g., Gemini 3.0) are prompted with explicit explanations and history of command abstraction affordances. They adaptively select steering modalities via in-context learning, dynamically adjusting abstraction to optimize task progress.
Figure 3: Two novel high-level policies: (a) fine-tuned embodied reasoning VLM, (b) off-the-shelf VLM using in-context learning to select optimal steering command abstraction.

Experimental Evaluation and Results

The evaluation spans both standard and challenging generalization splits (motion, spatial, semantic, and long-horizon tasks) using the Bridge WidowX dataset and OpenVLA codebase. Key findings:

Oracle human experiments: Allowing unrestricted steering modalities nearly saturates task performance. Individual command styles perform best in different settings; their combination is crucial for robustness, especially for semantic generalization and spatial reasoning.

Figure 4: Interactive interface for querying humans for oracle steering commands with support for pixel-based input.

Embodied reasoning VLM: When integrated with Steerable Policies, fine-tuned high-level reasoners outperform state-of-the-art reasoning VLAs (e.g., ECoT, ECoT-Lite, OpenVLA) by significant margins, especially under motion and semantic distribution shift.

Figure 5: Approach of controlling Steerable Policy with a learned high-level embodied reasoning model outperforms four baselines.

In-context learning VLMs: High-level VLMs adaptively adjust command abstraction during rollouts, employing corrective strategies when task progress stalls. This leads to strong improvements over baselines restricted to task/subtask-level prompts, directly leveraging scene understanding and reasoning capacities for long-horizon problem solving.

Figure 6: VLM uses in-context learning to ``discover'' the optimal steering abstraction, switching from pointing to motion commands as needed.

Policy Behavior: Manifold of Reasonable Actions

Steerable Policies, when conditioned on underspecified (atomic motion) commands, exhibit intelligent behavior by implicitly marginalizing over all possible tasks enabled by the scene and commanded movement. This “manifold of reasonable actions” enables robust task completion, but highlights the need for abstraction-flexible steering in ambiguous contexts.

Figure 7: Examples of manifold of ``reasonable'' actions; underspecified atomic motion commands induce task-aligned intelligent behaviors.

Implications and Future Directions

The work establishes that increasing low-level policy steerability is necessary for unlocking the full potential of pretrained VLMs in robotics. Diversity of steering command modalities leads to improved compositionality, generalization, and robustness, particularly for open-world or heavily randomized scenarios. The combination of hierarchical control and abstraction-flexible commands enables high-level reasoners (whether fine-tuned or in-context) to dynamically sequence problem-solving strategies, leveraging scene understanding and adaptive reasoning.

Practical implications include:

Dataset Design: Robotic datasets should prioritize behavioral diversity and richly annotated trajectories to maximize steerability and affordance learning.
Hierarchical Architectures: Implementing Steerable Policies is agnostic to VLA architecture and can be broadly applied to foundation model-based controllers.
Learning Frameworks: Future research may explore reinforcement learning to dynamically optimize abstraction selection policies, further improving cross-task transfer and adaptability.

Theoretical implications involve the necessity of flexible interfaces for hierarchical reasoning systems and the limits of subtask-only or single-modality conditioning. Steerable Policies are a step toward operationalizing VLMs’ “disembodied” intelligence in real-world domains.

Conclusion

Steerable Policies, via synthetic command augmentation and hierarchical VLM integration, represent a substantial advance in embodied policy controllability and generalization. By expanding the interface between high-level reasoners and low-level controllers to span multiple abstraction modalities, they enable improved utilization of foundation model competencies for robotic manipulation, task composition, and generalization under distribution shift. The paradigm informs future dataset construction, policy architecture, and hierarchical control strategies in embodied AI.