Papers
Topics
Authors
Recent
Search
2000 character limit reached

InterNav: A Benchmark for Interactive Navigation

Updated 15 January 2026
  • InterNav Dataset is a comprehensive, multi-modal benchmark that enables embodied agents to reason causally about interactive navigation through counterfactual 'what if' scenarios.
  • It provides detailed sensor data including RGB images, depth maps, 3D point clouds, and semantic annotations to support skill-aware training.
  • The dataset facilitates the evaluation of interaction strategies using physics-based simulations with quantifiable metrics on navigation success and path efficiency.

The InterNav dataset is a large-scale, multi-modal dataset specifically designed to enable causal, skill-aware reasoning in interactive navigation tasks for embodied agents. It is constructed to address the core limitation of existing navigation datasets, which generally presume the existence of a collision-free path, and thus neglect the requirement for agents to actively interact with and manipulate obstacles to create traversable paths. InterNav's primary goals are to (1) provide a corpus of egocentric training examples annotated with counterfactual "what if I removed that object?" information and (2) serve as a standardized, physics-based benchmark for evaluating interactive navigation systems in realistic, cluttered environments (Zhou et al., 7 Jan 2026).

1. Motivation and Purpose

The InterNav dataset targets two central objectives. First, it provides a large and diverse set of training samples for fine-tuning vision-LLMs (VLMs), endowing them with the ability to reason causally about when specific manipulation actions are necessary to achieve navigation goals given embodiment and skill constraints. Second, it establishes a reproducible, quantitative benchmark for interactive navigation systems, supporting fair comparison and ablation across reasoning and execution capabilities. InterNav explicitly bridges the gap between legacy datasets—which assume fixed traversability—and the real-world need for agents to evaluate and execute interactive behaviors, such as pushing or removing objects to create a feasible path.

2. Construction Methodology and Data Collection

The dataset is built upon 15 base layouts imported from Matterport3D, stratified into three complexity categories: Small Room, Large Room, and Room-to-Room, each with five independent obstacle configurations. For every scene, 10 start-goal pairs are sampled, yielding 150 benchmark episodes. Between 50 and 80 movable assets (e.g., boxes, barrels, chairs, doors) are randomly placed within each layout, with asset diversity exceeding 50 unique categories; size, pose, and texture are randomized to promote domain generalization.

Hundreds of camera viewpoints (with varied yaw, pitch, and mounting height) are captured per episode, resulting in approximately 10510^5 raw RGB and depth frame pairs. These modalities support the derivation of reconstructed 3D point clouds, semantic detections, occupancy maps, and traversability maps, which collectively underpin the counterfactual visual question–answering (VQA) samples at the heart of the dataset.

3. Data Modalities, Annotations, and Instance Structure

Each InterNav instance comprises:

  • An egocentric RGB image orgbo^{rgb} and its metric-scale depth map Z(u,v)Z(u,v).
  • A reconstructed 3D point cloud PP, generated by back-projecting ZZ using camera intrinsics KK.
  • Semantic detection of all candidate movable objects, provided as Grounding DINO bounding boxes and SAM segmentation masks, each labeled by category cobjc_{obj}.
  • A 2D occupancy map MoccM_{occ} and a robot-skill–conditioned traversability map MtravM_{trav}, computed based on the maximum climbable height hmaxh_{\max} and clearance radius rclearr_{clear}.
  • For each object kk, a 3D center PobjkP_{obj}^k, and a manipulability flag Fmanip(k)\mathcal{F}_{manip}(k) indicating reachability from any traversable base pose.
  • Counterfactual reasoning labels: for each manipulable object oo, path-length gain G(o)=1l(Mtravo,xg)l(Mtrav,xg)G(o)=1-\frac{l(M_{trav}^{-o},x_g)}{l(M_{trav},x_g)}, where MtravoM_{trav}^{-o} is the traversability map with object oo removed. The object oo^* maximizing G(o)G(o) (with G(o)>ϵG(o^*)>\epsilon) is identified as the removal target; if no such object exists, the correct action is direct navigation.

A typical VQA instance consists of inputs (RGB, goal pose xgx_g, skill set S\mathcal{S}, constraints C\mathcal{C}, and skill-aware traversability map MtravM_{trav}), a Chain-of-Thought (CoT) reasoning trace (skill feasibility, interaction necessity via G(o)G(o)), and a ground-truth answer (yy) specifying either navigation or a concrete skill–object action.

4. Scene Generation and Counterfactual Label Computation

Scene creation proceeds via procedural placement of movable assets, leveraging randomization to ensure distributional variety. For each episode and viewpoint, the methodology embraces the following steps:

Phase Description Tools/Parameters
Scene Creation Place 50–80 movable assets into each layout, randomizing size, pose, texture Matterport3D, random seed
Viewpoint Sampling Capture hundreds of camera poses per start-goal pair, all with varied orientations orgbo^{rgb}, Z(u,v)Z(u,v)
Metric Reconstruction Produce point cloud PP from ZZ with known intrinsics KK VGGT, Map-Anything
Map Generation Compute maximum height H(u,v)H(u,v), MoccM_{occ}, and MtravM_{trav} as skill-aware maps hmaxh_{\max}, rclearr_{clear}
Counterfactual Sampling For each manipulable kk: remove kkMtravkM_{trav}^{-k}, compute A* path-length, G(k)G(k), assign target/non-target label based on gain A* search, ϵ\epsilon

This process ensures that every VQA sample is informed by explicit causal impact: whether removing a given object will meaningfully improve or enable the path to the navigation goal under the robot's embodied skill set and constraints.

5. Dataset Statistics and Training Regimen

The dataset contains approximately 20,000 labeled VQA instances, split 90/10 between training and test partitions. Samples are derived from a pool of roughly 10510^5 raw frames. Each test instance comprises complete scene context, semantic annotations, skill-conditioned maps, manipulability masks, and counterfactual answer labels. Fine-tuning of Qwen3-VL on this curated corpus yields the InterNav-VLM model, where the optimization target is the token-level cross-entropy loss applied to the autoregressive output answering the VQA query ("navigate to xgx_g" or "use skill ss on object oo^*"). The data supports both passive (plan-only) and active (plan+interact) navigation regimes through skill conditioning.

6. Evaluation Metrics and Benchmarking

InterNav furnishes a systematic benchmark for both reasoning and embodied navigation execution. VQA reasoning accuracy is assessed on the held-out 10% test set (\sim2,000 instances), with per-instance skill and object selection marked correct/incorrect. The InterNav-VLM model achieves 78.35% overall accuracy, compared to 48–58% for LLM baselines such as GPT-4o and Gemini; accuracy stratified by embodiment is 80.21% for wheeled robots (without interaction skills) and 76.45% for legged manipulators (with skills).

For interactive navigation benchmarking, success rate (SR), path length (PL), and distance-to-goal (DTG) are reported over the 150 evaluation episodes. Metrics are provided per scene category (Small Room, Large Room, Room-to-Room) and as overall averages. By offering richly annotated, skill-aware, and counterfactual sampling integrated with physics-realistic simulation, InterNav facilitates rigorous evaluation of the "what if I moved that object?" reasoning and its translation to embodied execution (Zhou et al., 7 Jan 2026).

7. Significance and Applications

The InterNav dataset uniquely enables the development and benchmarking of vision-LLMs endowed with explicit skill awareness and counterfactual reasoning capabilities for real-world navigation tasks. It directly stimulates research in hierarchical frameworks, causal inference, and interactive planning within the context of embodied AI, as exemplified by its central role in training and evaluating the InterNav-VLM and CoINS framework. By internalizing causal chains—mapping embodiment, skills, and environmental context to actionable navigation policies—InterNav addresses a critical need for standardized, skill-conditioned, and manipulation-aware evaluation environments in the field of robotics and embodied vision-language reasoning (Zhou et al., 7 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InterNav Dataset.