Counterfactual Interactive Navigation with CoINS

Updated 15 January 2026

The paper introduces CoINS, a hierarchical robotics framework that unifies skill-aware vision-language reasoning with RL-trained low-level skills to actively clear obstacles in cluttered environments.
It leverages counterfactual reasoning to assess object manipulability and feasibility, guiding mobile manipulators to make causal decisions for effective navigation.
Empirical evaluations demonstrate substantial improvements in success rates and cross-embodiment performance, highlighting robust real-world transferability and dynamic obstacle manipulation.

Counterfactual Interactive Navigation via Skill-aware VLM (CoINS) is a hierarchical robotics framework that unifies skill-aware vision-language reasoning with robust, RL-trained low-level skills for interactive navigation in cluttered environments. Unlike canonical VLM-based navigation agents, which act as semantic reasoners and default to passive obstacle avoidance, CoINS enables mobile manipulators to actively assess the causal effects of object interactions and autonomously clear obstacles—extending capabilities beyond traditional static planning. The core innovation lies in counterfactual reasoning about environment manipulability, skill feasibility, and execution constraints, synthesized atop a metric- and affordance-grounded representation tied to the robot’s embodied capabilities (Zhou et al., 7 Jan 2026).

1. Hierarchical Framework and Policy Structure

CoINS implements a two-tiered hierarchy:

High-level Reasoning (InterNav-VLM): A fine-tuned vision-LLM forms the high-level policy:

$(s_t, q_t) = \pi_H(X_\mathrm{goal}, o^\mathrm{rgb}_t, \mathcal{S}, \mathcal{C})$

where $o^\mathrm{rgb}_t$ is the egocentric RGB observation, $X_\mathrm{goal}$ is the navigation goal, $\mathcal{S}$ denotes the symbolic skill set, and $\mathcal{C}$ the corresponding parametric capability descriptions (height limit $h_\mathrm{max}$ , clearance width $w_\mathrm{clear}$ , reach $d_\mathrm{reach}$ , object categories $C_\mathrm{obj}$ ). The policy emits a discrete skill $s_t \in \mathcal{S}$ and continuous execution parameters $o^\mathrm{rgb}_t$ 0.

Low-level Execution: The low-level policy translates the high-level command into robot-specific control:

$o^\mathrm{rgb}_t$ 1

using proprioceptive geometry $o^\mathrm{rgb}_t$ 2, producing joint-level servo targets via a learned whole-body controller.

This architectural split preserves the abstraction of task-level planning at the high level while leveraging robust, adaptive, low-level controllers that accommodate the dynamics and constraints of specific embodiments (e.g., quadrupeds with manipulators).

2. Skill-Aware Representation and Environmental Grounding

InterNav-VLM constructs a multi-modal input sequence:

Visual tokens extracted from a ViT backbone on $o^\mathrm{rgb}_t$ 3.
Textual embeddings encoding the navigation goal ( $o^\mathrm{rgb}_t$ 4), the explicit skill set $o^\mathrm{rgb}_t$ 5 (e.g., “navigate”, “climb”, “push”, “open_door”), and the robot’s precise affordance parameters ( $o^\mathrm{rgb}_t$ 6, $o^\mathrm{rgb}_t$ 7, $o^\mathrm{rgb}_t$ 8, $o^\mathrm{rgb}_t$ 9).

Metric grounding is achieved via a dense 3D reconstruction pipeline: a single RGB image is processed by VGGT (structural depth) and Map-Anything (absolute metric scale), followed by RANSAC-based ground-plane canonicalization. The resulting point cloud $X_\mathrm{goal}$ 0 is projected into a 2D grid map $X_\mathrm{goal}$ 1:

$X_\mathrm{goal}$ 2 (cell height)
$X_\mathrm{goal}$ 3 (occupancy based on robot limit)
$X_\mathrm{goal}$ 4 (traversability)

For object manipulation, detections from Grounding DINO are mapped to 3D ( $X_\mathrm{goal}$ 5). Manipulability is binary: $X_\mathrm{goal}$ 6, with $X_\mathrm{goal}$ 7 denoting the workspace reachable from base pose $X_\mathrm{goal}$ 8.

3. Distilling Counterfactual Reasoning into VLM

The distinguishing methodological contribution is embedding the logic of Navigation-Among-Movable-Obstacles (NAMO) directly into the VLM via supervised counterfactual VQA. For each start-goal pair $X_\mathrm{goal}$ 9:

Compute $\mathcal{S}$ 0, the optimal (A*) path length to $\mathcal{S}$ 1.
Let $\mathcal{S}$ 2, where $\mathcal{S}$ 3 is the map with object $\mathcal{S}$ 4 removed.
Declare the optimal target $\mathcal{S}$ 5.
If $\mathcal{S}$ 6, label as “interact (skill $\mathcal{S}$ 7, object $\mathcal{S}$ 8)”; else as “navigate.”

From these, $\mathcal{S}$ 9 chain-of-thought labeled VQA samples are generated, explicitly enumerating feasibility checks (height, reachability) and causal impact. InterNav-VLM is fine-tuned via standard token-prediction loss on Qwen3-VL. This enables implicit, efficient inference-time counterfactual reasoning—eliminating online map search for each hypothetical object removal.

4. RL Skill Library: Traversability-Oriented Manipulation

The low-level skill stack is a two-stage reinforcement learning (PPO) hierarchy:

Whole-body Controller: Receives commanded base velocity $\mathcal{C}$ 0, desired end-effector pose $\mathcal{C}$ 1, and proprioceptive signals. Output: 18 joint targets. Reward: velocity/EE tracking, smoothness, collision avoidance.
High-level skills:
- Navigation: Local $\mathcal{C}$ 2 → A* path → pure-pursuit controller.
- Traversability-Oriented Manipulation: Reward maximizes $\mathcal{C}$ 3, with $\mathcal{C}$ 4 favoring minimization of base–goal distance and deviation from optimal line; $\mathcal{C}$ 5 penalizing collisions and dynamic instability; $\mathcal{C}$ 6 minimizing energy/effort, ensuring minimal intervention required for gap clearance.
- Door Opening: Reward based on the squared error to the target door angle, EE-handle distance, collisions, and energy.

Aggressive domain randomization is used during Isaac Lab training (object mass/geometry/friction/sensor noise), yielding robust category-agnostic policies. All skills generalize across diverse movable assets (boxes, barrels, chairs, doors).

5. Benchmarking and Empirical Performance

Benchmarking is conducted in Isaac Sim using 15 Matterport3D-based scenes spanning three difficulty levels (Small Room, Large Room, Room-to-Room) with five variants each, yielding 150 episodes per evaluation.

Movable assets: $\mathcal{C}$ 7 objects per benchmark, each with high-fidelity physics and collision meshes.
Metrics: Success Rate (SR), Path Length (PL), Distance to Goal (DTG), and Interaction Count.

Performance results include:

Method	Overall SR	Room-to-Room SR	PL (m)	DTG (m)
CoINS	0.75	0.58	10.32	1.19
IN-Sight (baseline)	0.64	0.32	10.26	1.96

InterNav-VLM achieves 78.35% VQA accuracy, compared to 58.34% (Gemini-2.5-Pro) and 33.56% (pre-trained Qwen3-VL). By platform: 80.21% (wheeled), 76.45% (legged).
Relative gain for CoINS over IN-Sight: +17% overall SR; +81% SR in long-horizon Room-to-Room settings.
Ablations show critical dependence on fine-tuning (w/o: 0.17 SR) and traversability-oriented manipulation (w/o: 0.53 SR).
Cross-embodiment: On TurtleBot3 (no manipulator), CoINS matches Art-planner (0.93 SR).
Real-world deployment in cluttered settings demonstrates successful transfer (obstacle removal, door opening) without re-tuning. Skills generalize to physical boxes (60×40×50 cm, 55×45×20 cm) and left/right-hinged doors.

6. Failure Analysis and Limitations

Systematic failure analysis reveals:

$\mathcal{C}$ 835% of failures originate from VLM reasoning errors—e.g., unnecessary or incorrect object interaction, suboptimal skill selection.
$\mathcal{C}$ 965% of failures are attributable to execution (collisions, inadequate displacement).

Proposed directions:

Augment InterNav-VLM with explicit 3D spatial encoders to mitigate mis-reasoning in narrow or complex geometries.
Integrate memory modules for superior long-horizon planning.
Expand the skill library to encompass more complex manipulations (e.g., grasp-and-lift) and to support more diverse morphologies (such as humanoids).

A plausible implication is that improved metric-spatial representation and longer-term temporal reasoning may further reduce error rates and enable even broader generalization.

7. Significance and Broader Impact

CoINS establishes a new paradigm for embodied vision-language planning by merging counterfactual, affordance-aware reasoning with traversability-centric reinforcement learning skills. The framework tightly integrates high-level causal inference about environmental interventions with low-level policy robustness and real-world transferability. Empirical results demonstrate substantial gains in long-horizon interactive navigation, broad semantic and physical generalization, and competitive cross-embodiment fallback performance (Zhou et al., 7 Jan 2026). As research continues toward richer skill sets, improved spatial embeddings, and lifelong adaptation, the methodological trajectory exemplified by CoINS is likely to influence both embodied AI and the next generation of interactive robotic systems.

Markdown Report Issue Upgrade to Chat

References (1)

CoINS: Counterfactual Interactive Navigation via Skill-Aware VLM (2026)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Counterfactual Interactive Navigation via Skill-aware VLM (CoINS).