InterNav-VLM: Skill-Aware Navigation

Updated 15 January 2026

The paper highlights that InterNav-VLM improves navigation success by up to 17 percentage points using causal and skill-aware reasoning.
InterNav-VLM is a vision-language model that integrates metric-scale depth mapping and counterfactual analysis to select physical intervention skills.
The system’s hierarchical design and RL-trained low-level controllers enable robust performance in both simulated and real-world robotic navigation tasks.

InterNav-VLM is a fine-tuned vision-LLM (VLM) specifically adapted for hierarchical interactive navigation tasks in robotics, with a focus on skill-aware, causally-grounded decision making. Integrated as the high-level planner in the CoINS (Counterfactual Interactive Navigation via Skill-Aware VLM) framework, InterNav-VLM enables robots to reason not only about semantic scene understanding but also about the feasibility and necessity of physical interventions—such as pushing obstacles or opening doors—to create traversable paths in cluttered indoor environments. This advances prior VLM-based navigation systems, which have been limited by their passive, semantics-driven strategies and lack the ability to assess robot-specific physical affordances and consequences of scene modifications (Zhou et al., 7 Jan 2026).

1. Hierarchical Framework and Role of InterNav-VLM

InterNav-VLM operates as the high-level policy $\pi_H$ in a two-tiered hierarchical architecture. At each time step $t$ , $\pi_H$ receives as input:

the current egocentric RGB observation $o_t^{rgb}$ ,
the goal location $X_{goal}$ ,
the robot’s discrete skill set $S$ ,
and capability parameters $C$ (e.g., arm reach, traversal height).

It outputs a skill $s_t \in S$ and continuous parameters $q_t$ , formalized as

$(s_t, q_t) = \pi_H(X_{goal}, o_t^{rgb}, S, C).$

The selected skill, parameterized by $t$ 0, is then executed by a corresponding low-level controller $t$ 1, which is a library of Proximal Policy Optimization (PPO)-trained policies. This library spans:

an 18-DOF whole-body base+arm controller,
navigation-only control,
a traversability-oriented push skill,
and a door-opening skill.

This structured decomposition allows InterNav-VLM to guide physical interventions by actively selecting and parameterizing skills based on both scene semantics and robot-specific feasibility.

2. Skill-Aware Grounding and Metric-Scale Embedding

To transcend purely semantic reasoning, InterNav-VLM conditions its decisions on abstract skill symbols and concrete, metric constraints. Each skill type is defined with explicit physical affordances, such as:

Locomotion skills (“climb”) associated with a maximum traversal height $t$ 2 and clearance width $t$ 3,
Manipulation skills (“push_box”) tied to object categories $t$ 4 and maximum reach $t$ 5.

A metric-scale 3D environmental representation is constructed by fusing high-resolution relative depth from VGGT and coarse metric depth from Map-Anything. Scale normalization is achieved via

$t$ 6

yielding a point cloud $t$ 7 in a canonical frame. Ground plane alignment and 2D projections generate:

Height map $t$ 8,
Occupancy map $t$ 9,
Traversability map $\pi_H$ 0 after obstacle inflation by half-width $\pi_H$ 1.

Manipulation affordance grounding involves detection of candidate objects using Grounding DINO, 3D localization, and testing reachability, $\pi_H$ 2, given the arm workspace $\pi_H$ 3 when the base is positioned at $\pi_H$ 4. This representation enables fine-grained reasoning about which objects the robot can physically interact with and under which constraints.

3. Counterfactual Reasoning and Dataset Construction

A core innovation is the distillation of counterfactual reasoning within InterNav-VLM. Training data is generated by labeling navigation tasks with the causal effect of removing candidate objects on path length. For each object $\pi_H$ 5, the framework computes:

Baseline A* path length $\pi_H$ 6,
Counterfactual path length after object removal $\pi_H$ 7,
Counterfactual gain $\pi_H$ 8,
The object $\pi_H$ 9 that maximizes $o_t^{rgb}$ 0, with action taken only if $o_t^{rgb}$ 1.

This annotation process produces a chain-of-thought visual question answering dataset (≈20K samples), encoding both skill feasibility (w.r.t. $o_t^{rgb}$ 2) and causal necessity. Fine-tuning uses the standard next-token cross-entropy loss:

$o_t^{rgb}$ 3

As a result, InterNav-VLM internalizes decision criteria for when and how to act upon the environment for navigation enhancement, enabling implicit counterfactual evaluation at inference time (Zhou et al., 7 Jan 2026).

4. Low-Level Skill Library and RL-Based Execution

Skill execution is decentralized to a library of PPO-trained controllers, each addressing a distinct primitive:

Whole-Body Controller: Tracks commanded base velocity $o_t^{rgb}$ 4 and end-effector pose $o_t^{rgb}$ 5 with rewards for trajectory tracking, stability, and penalties for collisions and joint stress. Comprises 12 leg and 6 arm DOFs.
Navigation Skill: Operates with local A* on the generated occupancy map; utilizes pure-pursuit tracking and neutral arm posture.
Traversability-Oriented Push Skill: Maximizes post-push navigability rather than precise object placement. Reward is decomposed as $o_t^{rgb}$ 6, with navigation success, safety (collision/instability penalties), and control costs.
Door-Opening Skill: Optimizes for door angle agreement and end-effector proximity to handles, penalizing collisions.

These skills are directly invoked according to $o_t^{rgb}$ 7 outputs, enabling robust, real-world execution of high-level intervention policies.

5. Systematic Evaluation and Key Results

CoINS, incorporating InterNav-VLM, is benchmarked in Isaac Sim with 15 Matterport3D-derived scenes stratified by complexity: small room, large room, and room-to-room (door manipulation required). Each has 10 start–goal pairs (total 150 episodes), with ≈50 interactive assets of diverse categories. The simulator leverages GPU PhysX and RTX rendering to facilitate sim-to-real transfer.

Metrics:

Success Rate (SR): Proportion of episodes with $o_t^{rgb}$ 8 threshold.
Path Length (PL): Total traversed distance.
Distance to Goal (DTG): Residual distance after failures.

Performance Table:

Method	Success Rate (SR)	Path Length (PL, m)	DTG (m)
CoINS	0.75	10.32	1.19
IN-Sight	0.64	10.26	1.96
IN-ArmPush	0.55	-	-
Art-planner	0.18	-	-

In long-horizon, room-to-room settings, CoINS achieves SR=0.58 vs 0.32 for the next-best baseline (+81%). Fine-tuning ablation reduces SR from 0.68 to 0.17, and replacing the traversability push skill with a standard relocate-then-go scheme reduces SR to 0.53 and increases PL to 13.03 m.

On cross-embodiment tests with a TurtleBot3 (wheeled, non-manipulating), CoINS matches Art-planner’s SR=0.93 but with shorter PL (8.67 m vs 8.96 m), evidencing InterNav-VLM’s skill-set adaptability.

Real-world validation using a Unitree Go2 + ARX5 system in a cluttered classroom and corridor-to-room shows successful transfer: the VLM selects between pushing or opening depending on the context, and RL skills generalize to new object sizes and door types.

6. Failure Analysis and Limitations

A breakdown of failure cases reveals 42% due to VLM reasoning errors (unnecessary or omitted interventions, skill misselection) and 58% attributable to skill execution (collisions or insufficient displacement). Noted limitations include:

Occasional hallucinations in spatial reasoning, motivating the integration of richer 3D abstractions,
Absence of temporal memory for extended tasks, motivating memory module inclusion,
Current embodiment-agnosticity is limited; expansion to humanoids and other robot types is proposed (Zhou et al., 7 Jan 2026).

7. Implications and Extensions

By systematically grounding a VLM in both metric-scale physical affordances and causal reasoning sourced from counterfactual annotations, InterNav-VLM advances the state of vision-language navigation from passive obstacle avoidance to active, context-sensitive environment reconfiguration. The integration of differentiated RL skill controllers allows robust execution in simulation and deployment. Experimental results support substantial improvements in success rate (+17 pp overall; +80% in complex trajectories) relative to strong VLM and non-VLM baselines, both in simulation and on real robotic hardware (Zhou et al., 7 Jan 2026). Potential directions include integrating richer scene abstraction, temporal memory for long-horizon planning, and broader embodiment generalization.

Markdown Report Issue Upgrade to Chat

References (1)

CoINS: Counterfactual Interactive Navigation via Skill-Aware VLM (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InterNav-VLM.