Open Set Semantic Mapping

Updated 10 February 2026

Open set semantic mapping is a method that constructs spatial maps with open-vocabulary labels instead of fixed classes, allowing zero-shot recognition of novel objects.
It integrates multi-view semantic fusion with vision–language models and foundation embeddings to accurately segment and label both known and unseen entities.
The approach supports efficient spatial reasoning in robotics by enabling runtime, query-based mapping and adaptive scene interpretation without retraining.

Open set semantic mapping denotes the construction of a spatial or object-level world representation in which semantic labels are not restricted to a fixed category set defined at training. Instead, the mapping process leverages open-vocabulary segmentation, vision–LLMs, or foundation model embeddings to recognize, segment, and encode both previously seen and truly novel, user-specified categories without retraining, enabling robot agents and interactive systems to operate in previously unseen environments and handle concepts beyond closed-set limits (Yang et al., 3 Mar 2025, Popov et al., 13 Mar 2025, Jatavallabhula et al., 2023, Sheppard et al., 15 Dec 2025, Maggio et al., 7 Mar 2025, Loo et al., 2024, Günther et al., 3 Feb 2026, Xie et al., 17 Jul 2025, Yoo et al., 9 Dec 2025, Alama et al., 9 Apr 2025, Singh et al., 2024, Singh et al., 2024). Unlike closed-set approaches, which hard-code class vocabularies into network weights and map structures, open-set semantic mapping exposes explicit, extensible semantics in map elements (voxels, Gaussians, surfels, graph nodes), enabling run-time, zero-shot queries and high-level symbolic reasoning for robotics and spatial AI.

1. Problem Definition, Motivation, and Challenges

Open set semantic mapping aims to endow autonomous agents with the ability to construct spatial representations where the set of semantic concepts is not restricted to a pre-specified label set $C = \{c_1, \ldots, c_C\}$ but is effectively unbounded. Formally, each map element may be tagged with an open-vocabulary label or a high-dimensional embedding $f_k \in \mathbb{R}^d$ , supporting arbitrary zero-shot queries via similarity search, prompt-based segmentation, or LLM-driven reasoning (Jatavallabhula et al., 2023, Maggio et al., 7 Mar 2025, Popov et al., 13 Mar 2025, Sheppard et al., 15 Dec 2025, Yoo et al., 9 Dec 2025, Xie et al., 17 Jul 2025, Singh et al., 2024). This setting is motivated by fundamental limitations in closed-set mapping: robots cannot handle novel objects, long-tail classes, or rapidly changing environments without incurring costly retraining or manual annotation.

Key challenges in open-set semantic mapping include:

Discovery and association of arbitrary novel categories: No prior enumeration of all possible object types is possible; new semantic identities must be instantiated adaptively (Maggio et al., 7 Mar 2025, Singh et al., 2024, Blum et al., 2022).
Semantic label storage and retrieval: Map elements require explicit, mutable semantic tags or embeddings that support run-time extension (Yang et al., 3 Mar 2025, Sheppard et al., 15 Dec 2025, Günther et al., 3 Feb 2026, Loo et al., 2024).
Efficient fusion of multi-view, multi-modal semantic cues: Fusion across 2D foundation model outputs or high-dimensional embeddings must be computationally tractable and avoid memory bloat (Sheppard et al., 15 Dec 2025, Yang et al., 3 Mar 2025, Singh et al., 2024).
Uncertainty and outlier rejection: Spurious detections, open-set errors, or misalignments must be robustly handled via probabilistic or consensus mechanisms (Singh et al., 2024, Yang et al., 3 Mar 2025, Blum et al., 2022).
Compatibility with downstream planning and reasoning: Generated maps must support explicit reasoning over unknown classes and object-level scene graphs for high-level task execution (Loo et al., 2024, Günther et al., 3 Feb 2026, Xie et al., 17 Jul 2025).

2. Fundamental Representations and Semantic Storage

Open-set semantic mapping systems employ a range of representations to encode geometry and semantics:

Method/Framework	Map Element	Semantic Representation	Reference
ConceptFusion	point/surfel	CLIP/embedding vector $f_k$	(Jatavallabhula et al., 2023)
OpenGS-SLAM, Bayesian Fields, OpenMonoGS-SLAM	3D Gaussians	explicit open-vocab label or feature	(Yang et al., 3 Mar 2025, Maggio et al., 7 Mar 2025, Yoo et al., 9 Dec 2025)
SLIM-VDB	Voxel	Dirichlet (closed)/NIG (open) prior	(Sheppard et al., 15 Dec 2025)
RayFronts	Voxel + ray	language-aligned feature vector	(Alama et al., 9 Apr 2025)
Scene Graph Backed	Graph node/edge	CLIP/DINO feature, open label text	(Günther et al., 3 Feb 2026)
LOSS-SLAM, Open-Set Loop Closure	Sparse object node	DINO/MLP descriptor, uncertainty	(Singh et al., 2024, Singh et al., 2024)
OSG, osmAG-LLM	Graph node	open-vocab label, natural lang. desc	(Loo et al., 2024, Xie et al., 17 Jul 2025)

Representations are grouped into dense volumetric/fusion (voxel, surfel, Gaussian splat) and sparse, relational (scene graph, object node) structures. Geometry is recovered via standard volumetric fusion, point-based mapping, or recent 3D Gaussian Splatting (3DGS) techniques (Yang et al., 3 Mar 2025, Maggio et al., 7 Mar 2025, Yoo et al., 9 Dec 2025). Semantic attributes are either explicit integer labels from open-vocabulary detectors (e.g., YOLO-World, SAM), free-form language embeddings (e.g., CLIP, DINO), or probabilistic densities whose support grows at run time (Sheppard et al., 15 Dec 2025, Alama et al., 9 Apr 2025, Jatavallabhula et al., 2023).

3. Core Algorithmic Building Blocks

Common algorithmic modules across the open-set semantic mapping literature include:

Semantic Extraction: 2D foundation models—SAM, YOLO-World, RAM, CLIP, DINO—produce from each RGB frame a set of object masks, label proposals, and/or feature embeddings (Yang et al., 3 Mar 2025, Yoo et al., 9 Dec 2025, Maggio et al., 7 Mar 2025). Masks, regions, or image crops are encoded as $f_\text{img} \in \mathbb{R}^d$ and compared to open-vocabulary text prompts $f_\text{txt}\in\mathbb{R}^d$ via cosine similarity for zero-shot recognition (Jatavallabhula et al., 2023, Sheppard et al., 15 Dec 2025, Popov et al., 13 Mar 2025).
Geometric Association: Masked pixels are reprojected into 3D using pose and depth or monocular estimates and fused with map elements through spatial proximity or voxelization (Yang et al., 3 Mar 2025, Yoo et al., 9 Dec 2025, Sheppard et al., 15 Dec 2025, Alama et al., 9 Apr 2025, Singh et al., 2024).
Multi-View Semantic Fusion: Multi-view or multi-instance fusion is performed via weighted averaging (as in surfel fusion), Bayesian updating (SLIM-VDB, Bayesian Fields), or voting mechanisms (OpenGS-SLAM's Gaussian Voting Splatting) (Yang et al., 3 Mar 2025, Maggio et al., 7 Mar 2025, Sheppard et al., 15 Dec 2025). Confidence-based consensus aligns and propagates label assignments to maintain consistency under occlusions and view changes (Yang et al., 3 Mar 2025, Maggio et al., 7 Mar 2025).
Open-Dictionary and Label Propagation: New label strings or embeddings introduced at any time are seamlessly appended to the semantic state, permitting unbounded vocabulary growth and per-element mutability (Sheppard et al., 15 Dec 2025, Yang et al., 3 Mar 2025, Yoo et al., 9 Dec 2025).
Task-Driven/Clustered Object Extraction: Clustering of atomic map primitives into task-relevant objects utilizes data-driven or task-conditioned algorithms (e.g., Information Bottleneck, agglomerative clustering, community detection) and enables maps to reflect variable semantic granularity (Maggio et al., 7 Mar 2025, Nanwani et al., 2023).
Efficient Storage and Query: Hierarchical data structures (OpenVDB for voxels, B+-trees, memory banks) and memory-efficient fusion mechanisms maintain computational and storage tractability even with high-dimensional semantic state (Sheppard et al., 15 Dec 2025, Yang et al., 3 Mar 2025, Yoo et al., 9 Dec 2025).

4. Scene Graphs, Relational Structure, and Symbolic Interfaces

Open-set semantic maps increasingly expose their internal state as explicitly structured, symbolic data—scene graphs—for downstream spatial reasoning:

3D Semantic Scene Graphs (3DSSGs): Serve as the live backend, fusing geometric, semantic, and relational data. Nodes represent objects, places, frames; edges encode adjacency, containment, or semantic relations (Günther et al., 3 Feb 2026, Loo et al., 2024, Xie et al., 17 Jul 2025). Open-vocabulary labels and high-dimensional features are maintained per node for extension.
Incremental Refinement and Data Association: Each observation can yield new nodes, merges, or updated relations, with data association carried out via spatial proximity, IoU, and feature-space similarity (e.g., DINO, CLIP) and Bayesian models (Günther et al., 3 Feb 2026, Yang et al., 3 Mar 2025).
Hierarchical and Layered Graphs: Multi-level graphs represent not just "things" but region, place, or abstraction layers (room, floor, building), linked by configurable edge types (is_near, contains, connects_to) (Loo et al., 2024).
Semantic Uncertainty and Open-Set Graph Matching: Graph-based loop closure and object association explicitly incorporate feature uncertainty and support both underwater and terrestrial deployment (Singh et al., 2024).
Symbolic/LLM Integration: High-level reasoning modules (e.g., LLM planners or VQA systems) interface directly with open set semantic graphs, enabling spatial question answering, navigation, and zero-shot object retrieval by reasoning over open-vocabulary entities (Loo et al., 2024, Xie et al., 17 Jul 2025, Popov et al., 13 Mar 2025).

5. Quantitative Evaluation and Benchmarking

Open-set semantic mapping performance is quantitatively measured via:

3D Semantic Segmentation:
- Mean Accuracy (mAcc): $\frac{1}{C}\sum_{c=1}^C\frac{TP_c}{TP_c+FN_c}$
- Frequency-weighted mIoU: $\frac{\sum_c n_c\,IoU_c}{\sum_c n_c}$ , $IoU_c=\frac{TP_c}{TP_c+FP_c+FN_c}$
- Open-set or zero-shot IoU, specifically evaluating ability to handle true novel classes (Yang et al., 3 Mar 2025, Popov et al., 13 Mar 2025, Jatavallabhula et al., 2023, Sheppard et al., 15 Dec 2025, Maggio et al., 7 Mar 2025, Alama et al., 9 Apr 2025).
Scene Graph and QA Metrics:
- Precision/Recall over node/edge presence vs. ground truth graphs.
- VQA accuracy on binary, attribute, relational, and spatial queries via scene graph (Popov et al., 13 Mar 2025, Loo et al., 2024).
Navigation Metrics:
- Success Rate (SR), Success-path-length (SPL), and average path length (APL) for task-driven object search and navigation using open-set maps (Xie et al., 17 Jul 2025, Loo et al., 2024).
System Performance:
- Storage: total memory footprint (e.g., OpenGS-SLAM achieves 2x reduction over baselines; SLIM-VDB consumes 0.5–3.5 GB vs 27.7 GB) (Yang et al., 3 Mar 2025, Sheppard et al., 15 Dec 2025).
- Throughput: mapping/rendering FPS (e.g., OpenGS-SLAM >10x speedup; RayFronts: 8.84 Hz on embedded hardware) (Yang et al., 3 Mar 2025, Alama et al., 9 Apr 2025).
Benchmark Datasets:
- OSMa-Bench: systematic indoor sequence, lighting, and semantic QA evaluation (Popov et al., 13 Mar 2025).
- Real-world/Virtual: Replica, ScanNet, TUM, HM3D (Yang et al., 3 Mar 2025, Yoo et al., 9 Dec 2025, Popov et al., 13 Mar 2025, Xie et al., 17 Jul 2025).

6. Empirical Results, Limitations, and Future Research

Empirical studies consistently show that open-set semantic mapping leads to:

Superior generalization: Substantial gains in mIoU, object retrieval, and VQA accuracy for long-tail and previously unseen categories over closed-set baselines (e.g., >40% margin in 3D mIoU (Jatavallabhula et al., 2023); +13% mIoU (Yang et al., 3 Mar 2025)).
Zero-shot and cross-modal spatial reasoning: Ability to localize, segment, and interact with novel and compositional queries (text, image, audio) (Jatavallabhula et al., 2023, Maggio et al., 7 Mar 2025, Xie et al., 17 Jul 2025).
Efficient, scalable operation: State-of-the-art memory and runtime performance by exploiting sparse, hierarchical data structures and on-the-fly label extension (Sheppard et al., 15 Dec 2025, Yang et al., 3 Mar 2025, Alama et al., 9 Apr 2025).
Integration with downstream symbolic reasoning: LLM-driven planners or QA systems leveraging open-set scene graphs support zero-shot spatial tasks (Loo et al., 2024, Xie et al., 17 Jul 2025, Popov et al., 13 Mar 2025, Günther et al., 3 Feb 2026).

Key limitations and failure modes include:

Feature-space and prompt sensitivity: Performance hinges on the coverage and alignment of foundation model embedding spaces (e.g., CLIP, DINO) and may be affected by object occlusion, view angle, or prompt ambiguity (Maggio et al., 7 Mar 2025, Jatavallabhula et al., 2023).
Clustering and segmentation issues: Over/under-segmentation, hyperparameter sensitivity, and spurious splitting may degrade object-level granularity (Maggio et al., 7 Mar 2025, Nanwani et al., 2023, Singh et al., 2024).
Memory and compute trade-offs: Though substantial improvements are achieved, dense embedding maps remain memory intensive for large environments unless appropriately pruned or compressed (Sheppard et al., 15 Dec 2025, Alama et al., 9 Apr 2025).
Lighting/Environmental robustness: Scene segmentation can be impacted by difficult photometric conditions or dynamic contents, motivating development of photometric-invariant and temporally consistent mapping approaches (Popov et al., 13 Mar 2025).

Future research directions highlighted include hybrid 2D–3D fusion architectures, explicit "unknown" class detection and quantification, scene graph-based planning, open-set object discovery and loop-closure, and end-to-end co-training of geometry and semantics under variable sensor and environmental conditions (Popov et al., 13 Mar 2025, Maggio et al., 7 Mar 2025, Loo et al., 2024, Günther et al., 3 Feb 2026, Yoo et al., 9 Dec 2025).

7. Principal Frameworks and Systematic Taxonomy

Major recent frameworks exemplifying state-of-the-art open set semantic mapping methodologies include:

OpenGS-SLAM: Dense semantic SLAM based on 3D Gaussian Splatting, open-vocabulary 2D foundation model integration, explicit label voting and consensus, and segmentation pruning for high efficiency and accuracy (Yang et al., 3 Mar 2025).
ConceptFusion: Pixel-aligned, multimodal zero-shot feature fusion with open-vocabulary querying for complex environments (Jatavallabhula et al., 2023).
SLIM-VDB: Probabilistic Bayesian fusion over sparse volumetric OpenVDB grids with Dirichlet and Normal–Inverse-Gamma semantic priors, supporting unbounded label insertion (Sheppard et al., 15 Dec 2025).
Bayesian Fields: Task-driven semantic mapping using probabilistic multi-view fusion and information bottleneck-based clustering to yield adaptive object granularity (Maggio et al., 7 Mar 2025).
RayFronts: Cooperative in-range/out-of-range semantic mapping via fused voxels and "semantic ray frontiers", permitting dense local and exploratory global inference (Alama et al., 9 Apr 2025).
Scene Graph Backed 3DSSG: Online, persistent 3D semantic scene graphs as the live backend for efficient symbolic reasoning and hierarchical place-object semantics (Günther et al., 3 Feb 2026).
LOSS-SLAM and Open-Set Loop Closure: Lightweight, factor graph-based approaches for object-level SLAM, open-set data association, and uncertainty-aware object discovery (Singh et al., 2024, Singh et al., 2024).
Open Scene Graphs (OSG) and osmAG-LLM: Topo-semantic graphs enabling integration with LLMs for open-world object-goal navigation and zero-shot spatial reasoning (Loo et al., 2024, Xie et al., 17 Jul 2025).

These systems illustrate the current convergence of geometric mapping, open vocabulary semantic fusion, real-time symbolic knowledge graphs, and LLM-enabled reasoning in scalable, robust open-set semantic mapping.