InterNav: A Benchmark for Interactive Navigation
- InterNav Dataset is a comprehensive, multi-modal benchmark that enables embodied agents to reason causally about interactive navigation through counterfactual 'what if' scenarios.
- It provides detailed sensor data including RGB images, depth maps, 3D point clouds, and semantic annotations to support skill-aware training.
- The dataset facilitates the evaluation of interaction strategies using physics-based simulations with quantifiable metrics on navigation success and path efficiency.
The InterNav dataset is a large-scale, multi-modal dataset specifically designed to enable causal, skill-aware reasoning in interactive navigation tasks for embodied agents. It is constructed to address the core limitation of existing navigation datasets, which generally presume the existence of a collision-free path, and thus neglect the requirement for agents to actively interact with and manipulate obstacles to create traversable paths. InterNav's primary goals are to (1) provide a corpus of egocentric training examples annotated with counterfactual "what if I removed that object?" information and (2) serve as a standardized, physics-based benchmark for evaluating interactive navigation systems in realistic, cluttered environments (Zhou et al., 7 Jan 2026).
1. Motivation and Purpose
The InterNav dataset targets two central objectives. First, it provides a large and diverse set of training samples for fine-tuning vision-LLMs (VLMs), endowing them with the ability to reason causally about when specific manipulation actions are necessary to achieve navigation goals given embodiment and skill constraints. Second, it establishes a reproducible, quantitative benchmark for interactive navigation systems, supporting fair comparison and ablation across reasoning and execution capabilities. InterNav explicitly bridges the gap between legacy datasets—which assume fixed traversability—and the real-world need for agents to evaluate and execute interactive behaviors, such as pushing or removing objects to create a feasible path.
2. Construction Methodology and Data Collection
The dataset is built upon 15 base layouts imported from Matterport3D, stratified into three complexity categories: Small Room, Large Room, and Room-to-Room, each with five independent obstacle configurations. For every scene, 10 start-goal pairs are sampled, yielding 150 benchmark episodes. Between 50 and 80 movable assets (e.g., boxes, barrels, chairs, doors) are randomly placed within each layout, with asset diversity exceeding 50 unique categories; size, pose, and texture are randomized to promote domain generalization.
Hundreds of camera viewpoints (with varied yaw, pitch, and mounting height) are captured per episode, resulting in approximately raw RGB and depth frame pairs. These modalities support the derivation of reconstructed 3D point clouds, semantic detections, occupancy maps, and traversability maps, which collectively underpin the counterfactual visual question–answering (VQA) samples at the heart of the dataset.
3. Data Modalities, Annotations, and Instance Structure
Each InterNav instance comprises:
- An egocentric RGB image and its metric-scale depth map .
- A reconstructed 3D point cloud , generated by back-projecting using camera intrinsics .
- Semantic detection of all candidate movable objects, provided as Grounding DINO bounding boxes and SAM segmentation masks, each labeled by category .
- A 2D occupancy map and a robot-skill–conditioned traversability map , computed based on the maximum climbable height and clearance radius .
- For each object , a 3D center , and a manipulability flag indicating reachability from any traversable base pose.
- Counterfactual reasoning labels: for each manipulable object , path-length gain , where is the traversability map with object removed. The object maximizing (with ) is identified as the removal target; if no such object exists, the correct action is direct navigation.
A typical VQA instance consists of inputs (RGB, goal pose , skill set , constraints , and skill-aware traversability map ), a Chain-of-Thought (CoT) reasoning trace (skill feasibility, interaction necessity via ), and a ground-truth answer () specifying either navigation or a concrete skill–object action.
4. Scene Generation and Counterfactual Label Computation
Scene creation proceeds via procedural placement of movable assets, leveraging randomization to ensure distributional variety. For each episode and viewpoint, the methodology embraces the following steps:
| Phase | Description | Tools/Parameters |
|---|---|---|
| Scene Creation | Place 50–80 movable assets into each layout, randomizing size, pose, texture | Matterport3D, random seed |
| Viewpoint Sampling | Capture hundreds of camera poses per start-goal pair, all with varied orientations | , |
| Metric Reconstruction | Produce point cloud from with known intrinsics | VGGT, Map-Anything |
| Map Generation | Compute maximum height , , and as skill-aware maps | , |
| Counterfactual Sampling | For each manipulable : remove ⇒ , compute A* path-length, , assign target/non-target label based on gain | A* search, |
This process ensures that every VQA sample is informed by explicit causal impact: whether removing a given object will meaningfully improve or enable the path to the navigation goal under the robot's embodied skill set and constraints.
5. Dataset Statistics and Training Regimen
The dataset contains approximately 20,000 labeled VQA instances, split 90/10 between training and test partitions. Samples are derived from a pool of roughly raw frames. Each test instance comprises complete scene context, semantic annotations, skill-conditioned maps, manipulability masks, and counterfactual answer labels. Fine-tuning of Qwen3-VL on this curated corpus yields the InterNav-VLM model, where the optimization target is the token-level cross-entropy loss applied to the autoregressive output answering the VQA query ("navigate to " or "use skill on object "). The data supports both passive (plan-only) and active (plan+interact) navigation regimes through skill conditioning.
6. Evaluation Metrics and Benchmarking
InterNav furnishes a systematic benchmark for both reasoning and embodied navigation execution. VQA reasoning accuracy is assessed on the held-out 10% test set (2,000 instances), with per-instance skill and object selection marked correct/incorrect. The InterNav-VLM model achieves 78.35% overall accuracy, compared to 48–58% for LLM baselines such as GPT-4o and Gemini; accuracy stratified by embodiment is 80.21% for wheeled robots (without interaction skills) and 76.45% for legged manipulators (with skills).
For interactive navigation benchmarking, success rate (SR), path length (PL), and distance-to-goal (DTG) are reported over the 150 evaluation episodes. Metrics are provided per scene category (Small Room, Large Room, Room-to-Room) and as overall averages. By offering richly annotated, skill-aware, and counterfactual sampling integrated with physics-realistic simulation, InterNav facilitates rigorous evaluation of the "what if I moved that object?" reasoning and its translation to embodied execution (Zhou et al., 7 Jan 2026).
7. Significance and Applications
The InterNav dataset uniquely enables the development and benchmarking of vision-LLMs endowed with explicit skill awareness and counterfactual reasoning capabilities for real-world navigation tasks. It directly stimulates research in hierarchical frameworks, causal inference, and interactive planning within the context of embodied AI, as exemplified by its central role in training and evaluating the InterNav-VLM and CoINS framework. By internalizing causal chains—mapping embodiment, skills, and environmental context to actionable navigation policies—InterNav addresses a critical need for standardized, skill-conditioned, and manipulation-aware evaluation environments in the field of robotics and embodied vision-language reasoning (Zhou et al., 7 Jan 2026).