SD-OVON: A Semantics-aware Dataset and Benchmark Generation Pipeline for Open-Vocabulary Object Navigation in Dynamic Scenes

Published 24 May 2025 in cs.CV, cs.AI, and cs.RO | (2505.18881v1)

Abstract: We present the Semantics-aware Dataset and Benchmark Generation Pipeline for Open-vocabulary Object Navigation in Dynamic Scenes (SD-OVON). It utilizes pretraining multimodal foundation models to generate infinite unique photo-realistic scene variants that adhere to real-world semantics and daily commonsense for the training and the evaluation of navigation agents, accompanied with a plugin for generating object navigation task episodes compatible to the Habitat simulator. In addition, we offer two pre-generated object navigation task datasets, SD-OVON-3k and SD-OVON-10k, comprising respectively about 3k and 10k episodes of the open-vocabulary object navigation task, derived from the SD-OVON-Scenes dataset with 2.5k photo-realistic scans of real-world environments and the SD-OVON-Objects dataset with 0.9k manually inspected scanned and artist-created manipulatable object models. Unlike prior datasets limited to static environments, SD-OVON covers dynamic scenes and manipulatable objects, facilitating both real-to-sim and sim-to-real robotic applications. This approach enhances the realism of navigation tasks, the training and the evaluation of open-vocabulary object navigation agents in complex settings. To demonstrate the effectiveness of our pipeline and datasets, we propose two baselines and evaluate them along with state-of-the-art baselines on SD-OVON-3k. The datasets, benchmark and source code are publicly available.

Abstract PDF Upgrade to Chat

Summary

Semantics-aware Dataset and Benchmark Generation for Open-Vocabulary Object Navigation

The study presents SD-OVON, an innovative pipeline designed to facilitate the development and evaluation of open-vocabulary object navigation (OVON) agents operating in dynamic real-world settings. Through the integration of state-of-the-art visual-language models (VLMs) and large language models (LLMs), SD-OVON generates endless unique, photo-realistic scene variants that adhere to real-world semantics and commonsense. This approach addresses the limitations of static datasets that fail to account for the ever-changing configurations of real-world environments, where repositioning of everyday objects is a regular occurrence. By doing so, SD-OVON enables the generation of diverse training datasets and realistic benchmarks for navigating complex dynamic environments using open-vocabulary capabilities.

The authors propose two datasets, SD-OVON-3k and SD-OVON-10k, with approximately 3,000 and 10,000 episodes respectively. These datasets are derived from a combination of real-world scans and artist-created object models, consisting of 0.9k manually inspected manipulatable objects, designed for use within a simulated environment via a plugin compatible with the Habitat simulator. These datasets augment the realism of the generated environments and strengthen the capabilities of navigation agents in real-world applications, bridging both the real-to-sim and sim-to-real applications.

Two baseline methods—Random Receptacle Navigation A* and Semantic Navigation A*—are also proposed and evaluated, demonstrating how semantically aware pre-training models can enhance performance when compared to state-of-the-art baselines. The successful deployment of these methods underscores the effectiveness of SD-OVON and the included datasets for OVON agents encountering dynamic environments.

Several notable contributions define this work:

Dynamic Scene Generation: SD-OVON challenges traditional static dataset paradigms by incorporating realistic, dynamic scenes that reflect the dynamic behavior of daily environment rearrangements.
Extensive Dataset Offering: The two meticulously crafted datasets, offering thousands of episodes, stem from carefully selected object models, providing researchers with a high degree of realism and variability.
Semantic Relevance Infusion: This study promotes a novel approach to object navigation by incorporating regional and object-receptacle semantics, refining the relevance between target objects and environmental features.
Cross-platform Applicability: By supporting real-to-sim and sim-to-real transferral, the pipeline paves the way for broader applications in robotic systems interacting with variable environments in a more realistic manner.

The findings highlight critical aspects for advancing OVON technologies, notably the importance of incorporating environmental context and semantics in dynamic settings which closely reflect real-life scenarios. Future advancements should focus on further enhancing the diversity and realism of dynamically generated environments, potentially incorporating generative models for 3D asset creation. Moreover, additional methodologies and improvements in object detection within these contexts remain vital for advancing OVON research.