CoINS: Counterfactual Interactive Navigation via Skill-Aware VLM

Published 7 Jan 2026 in cs.RO | (2601.03956v1)

Abstract: Recent Vision-LLMs (VLMs) have demonstrated significant potential in robotic planning. However, they typically function as semantic reasoners, lacking an intrinsic understanding of the specific robot's physical capabilities. This limitation is particularly critical in interactive navigation, where robots must actively modify cluttered environments to create traversable paths. Existing VLM-based navigators are predominantly confined to passive obstacle avoidance, failing to reason about when and how to interact with objects to clear blocked paths. To bridge this gap, we propose Counterfactual Interactive Navigation via Skill-aware VLM (CoINS), a hierarchical framework that integrates skill-aware reasoning and robust low-level execution. Specifically, we fine-tune a VLM, named InterNav-VLM, which incorporates skill affordance and concrete constraint parameters into the input context and grounds them into a metric-scale environmental representation. By internalizing the logic of counterfactual reasoning through fine-tuning on the proposed InterNav dataset, the model learns to implicitly evaluate the causal effects of object removal on navigation connectivity, thereby determining interaction necessity and target selection. To execute the generated high-level plans, we develop a comprehensive skill library through reinforcement learning, specifically introducing traversability-oriented strategies to manipulate diverse objects for path clearance. A systematic benchmark in Isaac Sim is proposed to evaluate both the reasoning and execution aspects of interactive navigation. Extensive simulations and real-world experiments demonstrate that CoINS significantly outperforms representative baselines, achieving a 17\% higher overall success rate and over 80\% improvement in complex long-horizon scenarios compared to the best-performing baseline

Abstract PDF Upgrade to Chat

Summary

The paper introduces a hierarchical framework (CoINS) that combines skill-aware VLM reasoning with reinforcement learning to facilitate counterfactual interactive navigation.
It employs counterfactual logic within the InterNav-VLM to assess object removal and uses an RL-based skill library for precise robot control.
Experiments in both simulation and real-world settings demonstrate that CoINS outperforms traditional navigation methods in cluttered, dynamic environments.

Introduction

The paper "CoINS: Counterfactual Interactive Navigation via Skill-Aware VLM" (2601.03956) introduces a hierarchical framework designed to enhance robotic navigation in cluttered environments by integrating skill-aware Vision-LLMs (VLMs) with reinforcement learning-based skill libraries. This framework, CoINS, aims to overcome limitations of passive obstacle avoidance by allowing robots to physically manipulate their surroundings, thereby creating traversable paths. Unlike conventional methods, CoINS employs a VLM, dubbed InterNav-VLM, that embodies skill-awareness and counterfactual reasoning capabilities to intelligently determine interaction necessity and target selection.

Figure 1: InterNav-VLM in CoINS framework. The VLM reasoning module takes the robot's egocentric RGB observations and embodiment constraints as input to produce high-level interaction and navigation decisions, which are then translated by the skill execution module into precise motion controls for diverse interaction primitives.

Problem Formulation

The CoINS framework addresses interactive navigation as a task and motion planning problem in environments populated with movable and static obstacles. It consists of two main components: a skill-aware reasoning policy (InterNav-VLM) for high-level decision-making and an RL-based skill library for low-level control execution. The goal of CoINS is to navigate the robot to its target efficiently by using onboard sensors to understand the environment and make decisions on whether interaction is necessary to clear obstructions.

InterNav-VLM is fine-tuned to incorporate skill affordances and physical constraints into its reasoning context, facilitating physically grounded navigation decisions. The model uses counterfactual logic to evaluate the causal effects of object removal, thereby deciding when interaction is necessary to facilitate navigation.

Figure 2: VQA generation process.

RL-Based Skill Library

CoINS employs a quadruped robot equipped with a manipulator for diverse loco-manipulation tasks. This robot utilizes a low-level whole-body controller trained via PPO to maintain stability while executing high-level navigation and interaction skills. The high-level skills include traversability-oriented manipulation, which focuses on efficient path clearance, and door opening, among others.

The InterNav dataset was created using Isaac Sim to support training and evaluation of navigation frameworks in environments where interaction is required. It contains diverse scenes with realistic obstacles and varying complexity levels to challenge the robotic models in both simulated and real-world experiments.

Figure 3: Diverse scenes and objects in the dataset.

Simulation and Real-World Evaluation

In simulation tests, CoINS significantly outperformed traditional navigation methods and contemporary interactive systems like IN-Sight and IN-ArmPush. Additionally, successful real-world deployment in cluttered environments demonstrated its practical applicability. The robot utilized its learned skills effectively, showcased by its ability to interact with various objects and clear paths efficiently.

Conclusion

CoINS presents a significant advancement in interactive robotic navigation, combining vision-language reasoning with skill execution capabilities. Future work may focus on enhancing the model's 3D spatial reasoning and exploring deployment across diverse robot platforms, including humanoids. This framework potentially paves the way for more adaptive and intelligent robotic systems capable of operating autonomously in complex real-world environments.