InterNav: A Benchmark for Interactive Navigation

Updated 15 January 2026

InterNav Dataset is a comprehensive, multi-modal benchmark that enables embodied agents to reason causally about interactive navigation through counterfactual 'what if' scenarios.
It provides detailed sensor data including RGB images, depth maps, 3D point clouds, and semantic annotations to support skill-aware training.
The dataset facilitates the evaluation of interaction strategies using physics-based simulations with quantifiable metrics on navigation success and path efficiency.

The InterNav dataset is a large-scale, multi-modal dataset specifically designed to enable causal, skill-aware reasoning in interactive navigation tasks for embodied agents. It is constructed to address the core limitation of existing navigation datasets, which generally presume the existence of a collision-free path, and thus neglect the requirement for agents to actively interact with and manipulate obstacles to create traversable paths. InterNav's primary goals are to (1) provide a corpus of egocentric training examples annotated with counterfactual "what if I removed that object?" information and (2) serve as a standardized, physics-based benchmark for evaluating interactive navigation systems in realistic, cluttered environments (Zhou et al., 7 Jan 2026).

1. Motivation and Purpose

The InterNav dataset targets two central objectives. First, it provides a large and diverse set of training samples for fine-tuning vision-LLMs (VLMs), endowing them with the ability to reason causally about when specific manipulation actions are necessary to achieve navigation goals given embodiment and skill constraints. Second, it establishes a reproducible, quantitative benchmark for interactive navigation systems, supporting fair comparison and ablation across reasoning and execution capabilities. InterNav explicitly bridges the gap between legacy datasets—which assume fixed traversability—and the real-world need for agents to evaluate and execute interactive behaviors, such as pushing or removing objects to create a feasible path.

2. Construction Methodology and Data Collection

The dataset is built upon 15 base layouts imported from Matterport3D, stratified into three complexity categories: Small Room, Large Room, and Room-to-Room, each with five independent obstacle configurations. For every scene, 10 start-goal pairs are sampled, yielding 150 benchmark episodes. Between 50 and 80 movable assets (e.g., boxes, barrels, chairs, doors) are randomly placed within each layout, with asset diversity exceeding 50 unique categories; size, pose, and texture are randomized to promote domain generalization.

Hundreds of camera viewpoints (with varied yaw, pitch, and mounting height) are captured per episode, resulting in approximately $10^5$ raw RGB and depth frame pairs. These modalities support the derivation of reconstructed 3D point clouds, semantic detections, occupancy maps, and traversability maps, which collectively underpin the counterfactual visual question–answering (VQA) samples at the heart of the dataset.

3. Data Modalities, Annotations, and Instance Structure

Each InterNav instance comprises:

An egocentric RGB image $o^{rgb}$ and its metric-scale depth map $Z(u,v)$ .
A reconstructed 3D point cloud $P$ , generated by back-projecting $Z$ using camera intrinsics $K$ .
Semantic detection of all candidate movable objects, provided as Grounding DINO bounding boxes and SAM segmentation masks, each labeled by category $c_{obj}$ .
A 2D occupancy map $M_{occ}$ and a robot-skill–conditioned traversability map $M_{trav}$ , computed based on the maximum climbable height $h_{\max}$ and clearance radius $r_{clear}$ .
For each object $k$ , a 3D center $P_{obj}^k$ , and a manipulability flag $\mathcal{F}_{manip}(k)$ indicating reachability from any traversable base pose.
Counterfactual reasoning labels: for each manipulable object $o$ , path-length gain $G(o)=1-\frac{l(M_{trav}^{-o},x_g)}{l(M_{trav},x_g)}$ , where $M_{trav}^{-o}$ is the traversability map with object $o$ removed. The object $o^*$ maximizing $G(o)$ (with $G(o^*)>\epsilon$ ) is identified as the removal target; if no such object exists, the correct action is direct navigation.

A typical VQA instance consists of inputs (RGB, goal pose $x_g$ , skill set $\mathcal{S}$ , constraints $\mathcal{C}$ , and skill-aware traversability map $M_{trav}$ ), a Chain-of-Thought (CoT) reasoning trace (skill feasibility, interaction necessity via $G(o)$ ), and a ground-truth answer ( $y$ ) specifying either navigation or a concrete skill–object action.

4. Scene Generation and Counterfactual Label Computation

Scene creation proceeds via procedural placement of movable assets, leveraging randomization to ensure distributional variety. For each episode and viewpoint, the methodology embraces the following steps:

Phase	Description	Tools/Parameters
Scene Creation	Place 50–80 movable assets into each layout, randomizing size, pose, texture	Matterport3D, random seed
Viewpoint Sampling	Capture hundreds of camera poses per start-goal pair, all with varied orientations	$o^{rgb}$ , $Z(u,v)$
Metric Reconstruction	Produce point cloud $P$ from $Z$ with known intrinsics $K$	VGGT, Map-Anything
Map Generation	Compute maximum height $H(u,v)$ , $M_{occ}$ , and $M_{trav}$ as skill-aware maps	$h_{\max}$ , $r_{clear}$
Counterfactual Sampling	For each manipulable $k$ : remove $k$ ⇒ $M_{trav}^{-k}$ , compute A* path-length, $G(k)$ , assign target/non-target label based on gain	A* search, $\epsilon$

This process ensures that every VQA sample is informed by explicit causal impact: whether removing a given object will meaningfully improve or enable the path to the navigation goal under the robot's embodied skill set and constraints.

5. Dataset Statistics and Training Regimen

The dataset contains approximately 20,000 labeled VQA instances, split 90/10 between training and test partitions. Samples are derived from a pool of roughly $10^5$ raw frames. Each test instance comprises complete scene context, semantic annotations, skill-conditioned maps, manipulability masks, and counterfactual answer labels. Fine-tuning of Qwen3-VL on this curated corpus yields the InterNav-VLM model, where the optimization target is the token-level cross-entropy loss applied to the autoregressive output answering the VQA query ("navigate to $x_g$ " or "use skill $s$ on object $o^*$ "). The data supports both passive (plan-only) and active (plan+interact) navigation regimes through skill conditioning.

6. Evaluation Metrics and Benchmarking

InterNav furnishes a systematic benchmark for both reasoning and embodied navigation execution. VQA reasoning accuracy is assessed on the held-out 10% test set ( $\sim$ 2,000 instances), with per-instance skill and object selection marked correct/incorrect. The InterNav-VLM model achieves 78.35% overall accuracy, compared to 48–58% for LLM baselines such as GPT-4o and Gemini; accuracy stratified by embodiment is 80.21% for wheeled robots (without interaction skills) and 76.45% for legged manipulators (with skills).

For interactive navigation benchmarking, success rate (SR), path length (PL), and distance-to-goal (DTG) are reported over the 150 evaluation episodes. Metrics are provided per scene category (Small Room, Large Room, Room-to-Room) and as overall averages. By offering richly annotated, skill-aware, and counterfactual sampling integrated with physics-realistic simulation, InterNav facilitates rigorous evaluation of the "what if I moved that object?" reasoning and its translation to embodied execution (Zhou et al., 7 Jan 2026).

7. Significance and Applications

The InterNav dataset uniquely enables the development and benchmarking of vision-LLMs endowed with explicit skill awareness and counterfactual reasoning capabilities for real-world navigation tasks. It directly stimulates research in hierarchical frameworks, causal inference, and interactive planning within the context of embodied AI, as exemplified by its central role in training and evaluating the InterNav-VLM and CoINS framework. By internalizing causal chains—mapping embodiment, skills, and environmental context to actionable navigation policies—InterNav addresses a critical need for standardized, skill-conditioned, and manipulation-aware evaluation environments in the field of robotics and embodied vision-language reasoning (Zhou et al., 7 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

CoINS: Counterfactual Interactive Navigation via Skill-Aware VLM (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InterNav Dataset.

InterNav: A Benchmark for Interactive Navigation

1. Motivation and Purpose

2. Construction Methodology and Data Collection

3. Data Modalities, Annotations, and Instance Structure

4. Scene Generation and Counterfactual Label Computation

5. Dataset Statistics and Training Regimen

6. Evaluation Metrics and Benchmarking

7. Significance and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

InterNav: A Benchmark for Interactive Navigation

1. Motivation and Purpose

2. Construction Methodology and Data Collection

3. Data Modalities, Annotations, and Instance Structure

4. Scene Generation and Counterfactual Label Computation

5. Dataset Statistics and Training Regimen

6. Evaluation Metrics and Benchmarking

7. Significance and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research