Open-set Semantic Mapping in 3D Environments
- Open-set semantic mapping is a process that builds 3D maps with flexible, language-aligned annotations beyond fixed label sets, enabling zero-shot queries.
- It leverages ensemble foundation models like CLIP, DINO, and SAM for instance detection, segmentation, and embedding fusion to integrate novel objects into geometric representations.
- This approach supports dynamic applications such as natural language-guided navigation and real-time mapping in changing environments with robust, queryable semantic information.
Open-set semantic mapping is the process of constructing geometric maps of 3D environments that are densely or instance-wise annotated with semantic information, where the set of object or concept labels is not restricted to a pre-defined closed set but can extend to novel or arbitrary classes as encountered online. The resulting maps are designed to be queryable by natural language, images, or other modalities, enabling zero-shot or open-ended retrieval and spatial reasoning on objects, parts, and regions never seen during system training. Recent advances leverage foundation models in vision and vision-language domains—including segmenters, object detectors, and text-image encoders such as CLIP and DINO—to provide these maps with robust, transferable, and flexible representations for navigation, manipulation, and general robotic intelligence.
1. Problem Definition and Motivation
Open-set semantic mapping seeks to build a 3D scene representation, given a time-ordered sequence of posed RGB-D (or RGB-only, for monocular SLAM) observations , that encodes
- The explicit geometry of the environment as points, surfels, voxels, or other primitives.
- Instance or region-level groupings reflecting discrete objects or semantically significant regions.
- For each object or region, one or more continuous vision-language embeddings and/or open-set labels, not limited to a pre-trained class vocabulary.
This is in contrast to traditional closed-set mapping, where categories are fixed to a finite set (e.g., COCO or NYUv2 classes) and only these can be attached to map elements. In the open-set regime, the system cannot assume that all potential categories are known a priori; it must be able to detect, differentiate, and index novel classes or entities at run time (Nanwani et al., 2024).
Motivations for open-set semantic mapping include:
- Robust language-guided navigation and planning for instructions targeting rare or unseen objects.
- Real-time adaptation and mapping in environments domain-shifted from training data.
- Instance-specific spatial reasoning needed for human-robot interaction, e.g., "go to the blue chair left of the printer."
- Generalization across tasks, agents, and deployment settings, breaking the constraint of closed-set recognition.
2. Core Methodological Approaches
2.1 Detection and Aggregation of Open-set Instances
Modern open-set pipelines typically employ an ensemble of foundation models for segmentation, detection, and feature extraction:
- Instance proposal: class-agnostic segmenters (e.g., SAM), open-vocabulary taggers (e.g., RAM), and open-vocabulary detector-specialized transformers (e.g., Grounding DINO, YOLO-World) are combined in sequence or in parallel. Class-agnostic masks from SAM provide the grouping, while tags from RAM and bounding boxes from detectors are used for associating (potentially open-set) text labels.
- Feature extraction: Instance or region image crops are embedded via powerful vision-LLMs such as CLIP or DINOv2 (Nanwani et al., 2024).
- Back-projection: Masked regions are associated with depth and camera pose to generate point clouds or voxel sets for geometric integration.
2.2 Incremental Data Fusion and Instance Management
Open-set semantic instance maps incrementally associate new detections with existing map objects by measuring both geometric overlap and similarity in the joint vision-language embedding space:
- Embedding fusion and averaging: When a new observation is merged into an existing instance, the semantic embedding is updated via a weighted average determined by pixel/point count.
- Similarity metrics: Cosine similarity in embedding space (), geometric overlap (), or a weighted combination thereof () are used for detection-instance association and novel instance decision thresholds.
- Novelty detection: If no existing instance in the map is close enough under , the detection is instantiated as a new open-set class without retraining (Nanwani et al., 2024).
2.3 Querying and Language Alignment
Because all map entities carry language-aligned embeddings, queries—free-form text, image, or even audio—are embedded into the same vector space and compared against map entities by cosine similarity to retrieve corresponding spatial regions or object instances. This mechanism underpins zero-shot and open-ended retrieval (Nanwani et al., 2024).
2.4 Geometric Representations
Representations include:
- 3D point clouds/TSDF/voxels: Each entity stores a union of all 3D points or voxels observed so far, enabling rigorous geometry-based association.
- Gaussian splats: High-fidelity 3DGS methods maintain sets of 3D Gaussians for dense, photorealistic, and memory-efficient reconstructions with semantic fields (Yoo et al., 9 Dec 2025, Maggio et al., 7 Mar 2025).
- Sparse instance graphs/scene graphs: Graphs with nodes as objects or regions, and edges denoting spatial or topological relationships, are especially amenable to integration with LLM-based reasoning (Loo et al., 2024).
3. Example Architectures and Pipelines
| System | Core Representation | Instance Grouping | Semantic Embedding | Novelty Handling/Query |
|---|---|---|---|---|
| O3D-SIM | 3D point clouds | Mask fusion, DBSCAN | CLIP + DINO, fused | Cosine sim, threshold |
| ConceptFusion | Surfels/TSDF | Mask2Former/SAM | CLIP/DINO, per-pixel | Multimodal queries |
| OpenMonoGS-SLAM | 3D Gaussians | SAM + CLIP, cluster | Attn-mem CLIP codes | Cosine sim, memory |
| OpenGS-SLAM | 3D Gaussians | Voting splat, label consensus | Explicit 1D id | Online label table |
| DualMap | Anchor/Volatile object set | YOLO/MobileSAM + FastSAM | CLIP-image/text mix | Cos + region grouping |
| OSG | Scene graph (rooms/objs) | VFM mask, LLM | VFM + LLM text attrs | LLM-aided traversal |
Details for each can be found in (Nanwani et al., 2024, Jatavallabhula et al., 2023, Yoo et al., 9 Dec 2025, Yang et al., 3 Mar 2025, Jiang et al., 2 Jun 2025, Loo et al., 2024).
4. Key Innovations and Technical Components
- Open-set 2D segmentation and detection: Class-agnostic instance proposal pipelines (e.g., [RAM → Grounding DINO → SAM] in O3D-SIM (Nanwani et al., 2024)) allow observation of novel entities not seen at training time.
- Vision-language feature fusion: Combining CLIP and DINO (or similar) embeddings at the instance level ensures robustness to appearance changes and semantic breadth.
- Instance-level embedding and merging: Semantic grouping based on the joint vision-language feature space, together with geometric fusion, stabilizes mapping, supports instance discrimination, and enables seamless addition of previously unseen categories.
- Language-aligned queryability: Each instance or region can be indexed by textual description, enhancing natural-language interfaces and arbitrary open-vocabulary tasks.
- Self-supervised and probabilistic update rules: Bayesian fusion, memory-efficient aggregation, and explicit modeling of feature or detection uncertainty (e.g., Normal-Inverse-Gamma tracking in (Sheppard et al., 15 Dec 2025); semantic uncertainty-aware affinity in (Singh et al., 2024)) deliver robust online operation.
- Hierarchical/topological structuring: Scene/region graphs and anchor/volatile object dual-maps enhance efficiency, robustness to dynamic scenes, and LLM-based commonsense reasoning for task planning.
5. Evaluation Protocols and Benchmark Results
Evaluation centers on both geometric/semantic map fidelity and the success rate of downstream tasks (e.g., language-guided navigation):
3D Segmentation Metrics:
- Mean Intersection-over-Union (mIoU): voxel/point-level overlap with ground-truth classes (as in Replica, ScanNet).
- Instance-level mean average precision (mAP): for instance segmentation benchmarks (e.g., mAPâ‚…â‚€ in SceneNN, ScanNet (Liu et al., 2024)).
- Success Rate (SR): fraction of vision-language navigation tasks completed successfully (robot reaches the correct instance given a language or semantic query).
Example closed/open-set results (from (Nanwani et al., 2024)):
| Method | Human SR | Auto SR |
|---|---|---|
| VLMaps | 0.40 | 0.46 |
| SI-Maps (K=5) | 0.70 | 0.74 |
| O3D-SIM | 0.82 | 0.84 |
Qualitative evaluations confirm that open-set methods segment and retrieve instances ("wheelchairs," "mannequins," "mobile robots") inaccessible to closed-set models. In online mapping, memory footprint and mapping rate are also tracked, with recent systems (e.g., SLIM-VDB (Sheppard et al., 15 Dec 2025)) demonstrating order-of-magnitude reductions in resource use for dense open-set maps.
6. Representative Applications and Experimental Insights
- Language-Guided Navigation/ObjectNav: Agents use natural language queries to navigate to specific objects or regions, with open-set mapping permitting specification of arbitrary or compositional goals (e.g., "find the blue mannequin left of the printer") (Nanwani et al., 2024).
- Active Exploration: Systems exploit open-set semantic maps to inform exploration drives (frontier selection, information-driven region queries), outperforming fixed-class or geometric-only planners (Kuang et al., 2024, Laina et al., 11 Apr 2025).
- Interaction with Dynamic Environments: Dual-map and sparse scene-graph approaches (e.g., DualMap (Jiang et al., 2 Jun 2025), osmAG-LLM (Xie et al., 17 Jul 2025)) allow for reasoning and recovery under object re-localization, scene rearrangement, and unobserved objects.
- Cross-Modality and Multilingual Retrieval: ConceptFusion-style models accept image, audio, or text queries in multiple languages using aligned foundation model embeddings (Jatavallabhula et al., 2023).
- Memory- and Compute-Efficient Embedded Applications: Submap or Gaussian-splat methods (e.g., FindAnything (Laina et al., 11 Apr 2025), OpenMonoGS-SLAM (Yoo et al., 9 Dec 2025)) deliver real-time performance on resource-constrained platforms by minimizing per-region feature storage and leveraging efficient fusion.
7. Limitations, Challenges, and Directions
- Feature/embedding drift: Long-term drift in vision-LLMs or their statistics may impact open-set discrimination and map stability; uncertainty estimation and online adaptation are ongoing research topics (Singh et al., 2024, Sheppard et al., 15 Dec 2025).
- Spatial Consistency: Over-segmentation (due to inconsistent mask fragments from different views) is mitigated by merging via semantic/3D overlap but remains a challenge for fine-grained objects (Liu et al., 2024).
- Memory Scalability: Storing high-dimensional features per voxel or surfel is prohibitive in large maps; memory bank, object-centric fusion, and sparse/instance-level representations mitigate this issue (Yoo et al., 9 Dec 2025, Laina et al., 11 Apr 2025).
- Dynamic and Unmapped Object Handling: Methods such as osmAG-LLM focus on robust reasoning under dynamic movement and unseen instances via text-centric and LLM-based probabilistic reasoning (Xie et al., 17 Jul 2025).
- Evaluation under Hard Conditions: OSMa-Bench evaluates open-set mapping under varying illumination and dynamic scenes, highlighting degradation points and informing directions such as adaptive normalization and "unknown embedding" learning (Popov et al., 13 Mar 2025).
In sum, open-set semantic mapping fundamentally expands the capabilities of robot perception and interaction by eliminating the closed-category constraint, enabling semantic generalization via integration of strong vision-language foundation models, and supporting robust, instance-aware task execution in complex, real-world environments (Nanwani et al., 2024).