Open-Vocabulary 3D Segmentation

Updated 9 February 2026

Open-vocabulary 3D segmentation is a novel approach that allows arbitrary text queries to segment complex 3D data, overcoming the fixed-class limitations of traditional methods.
It integrates multi-view feature transfer, geometry-based fusion, and vision–language models like CLIP and SAM to achieve detailed instance, part, and region segmentation.
Applications span AR/VR, robotics, simulation, and human parsing, with benchmarks showing significant improvements in mIoU and prompt inference efficiency.

Open-vocabulary 3D segmentation refers to methods that can segment 3D data—such as point clouds, meshes, or implicit representations—according to arbitrary user-specified or natural language queries, rather than limiting predictions to a fixed, pre-defined set of classes. The recent surge of research in this domain is driven by advancements in vision–LLMs (VLMs), multi-view transfer pipelines, and 3D foundational representations. This enables unprecedented generalization and flexibility for 3D understanding tasks, affecting core areas in AR/VR, robotics, and simulation.

1. Problem Definition and Historical Context

Open-vocabulary 3D segmentation requires predicting masks or semantic labels for 3D entities (points, regions, instances, parts) based on arbitrary text prompts. Unlike classical closed-set segmentation, models here must generalize beyond fixed taxonomies—handling unseen classes, object parts, materials, and functional/affordance queries. The field originated from 2D open-vocabulary approaches (CLIP, ODISE, Grounded-SAM), but faces unique challenges:

Rich geometric variation and lack of large 3D-language datasets
Multi-view alignment, occlusion, and depth ambiguities
Need for instance, part, and region-level search—spanning hierarchies

Pioneering systems adapted 2D methods by projecting features or masks onto 3D (Takmaz et al., 2023), but suffered from poor instance separability and limited granularity. This motivated pipelines that explicitly fuse multi-view 2D vision–language priors with geometry-aware 3D learning.

2. Methodological Frameworks and Core Pipelines

Modern open-vocabulary 3D segmentation approaches fall into several dominant paradigms, differing by the underlying 3D representation, feature transfer protocol, and level of annotation required.

a) Multi-View Feature Transfer Pipelines

Render the 3D scene from multiple virtual camera views.
Use class-agnostic 2D segmentation (often SAM or Grounded-SAM) to propose segment masks per view.
Extract mask-level VLM features (e.g., CLIP or diffusion-model embeddings).
Unproject 2D masks (using depth and camera intrinsics/extrinsics) into the 3D domain.
Fuse multi-view masks/features via clustering, voting, or learned fusion operators (Takmaz et al., 2023, Nguyen et al., 2023, Boudjoghra et al., 2024, Takmaz et al., 2024).

b) 3D Gaussian Splatting and Neural Scene Representations

Represent the scene as a set of explicit or structured 3D Gaussians, each carrying geometric, appearance, and language/instance embeddings.
Supervise 3D features with projected 2D VLM mask features using contrastive or InfoNCE-style losses.
For instance/part segmentation, cluster Gaussians in feature space or via codebook quantization (Piekenbrinck et al., 9 Jun 2025, Liang et al., 2024, Lu et al., 28 Mar 2025, Huang et al., 21 Oct 2025).
Querying proceeds by encoding a prompt or image into joint space and finding best-aligned 3D entities.

c) Mesh/Point-Cloud-Only and Direct 3D Segmentation

When only a point cloud or mesh is available (or in large urban scenes), perform multi-view, multi-granularity rendering for 2D VLM feature extraction, then aggregate and distill to a 3D backbone (Wang et al., 13 Sep 2025).
Use sample-balanced fusion, geometric clustering, or self-supervised distillation to bridge the domain gap.

d) Online/Tracking and Language-Reasoning Approaches

Employ 2D open-vocabulary detection or segmentation and online tracking to build instance proposals in mesh-free settings.
For proposal classification, replace CLIP-style head with an MLLM or similar module for improved compositional and functional query understanding (Zhou et al., 3 Dec 2025).

e) Diffusion Backbone and Mask-Based Distillation

Utilize text-to-image diffusion models pre-trained on large image–caption corpora.
Generate salient-aware and geometric-aware mask embeddings, distilling these into 3D geometry with mask-level losses (Zhu et al., 2024).

3. Key Model Components and Technical Foundations

The following table summarizes major algorithmic components recurrent across state-of-the-art open-vocabulary 3D segmentation methods.

Component	Description/Equation	Representative Works
Multi-view rendering	Render V images given P points/mesh, with camera calibration	(Suzuki et al., 27 Feb 2025, Takmaz et al., 2023, Boudjoghra et al., 2024)
2D mask proposal	SAM/Grounded-SAM over rendered images for m_{i,j}^{2D}	(Suzuki et al., 27 Feb 2025, Huang et al., 21 Oct 2025, Takmaz et al., 2024)
Feature lifting	f_{2D,p} = average of per-view f_{k} for point p	(Wang et al., 13 Sep 2025, Huang et al., 21 Oct 2025)
Fusion/mask scoring	Y = M·P (per-point score for masks and prompts)	(Suzuki et al., 27 Feb 2025)
VLM encoding	CLIP: s = cosine(f_v(x), f_t(c)); Diffusion: per-mask CLIP head	(Suzuki et al., 27 Feb 2025, Liang et al., 2024, Zhu et al., 2024)
Instance clustering	HDBSCAN, VQ-VAE quantization, k-means over mask features	(Piekenbrinck et al., 9 Jun 2025, Liang et al., 2024, Huang et al., 21 Oct 2025)
Prompt decoupling	Masks/features precomputed, only text encoder/fusion at query-time	(Suzuki et al., 27 Feb 2025, Boudjoghra et al., 2024)
Mask proposal refinement	Multi-view consensus, iterative merging/removal, superpoint fusion	(Jung et al., 30 Jul 2025, Nguyen et al., 2023, Zhou et al., 3 Dec 2025)

4. Benchmarks, Metrics, and Experimental Insights

Rigorous evaluation requires datasets with aligned 3D scans, RGB-D video, and semantic/instance/part annotations. State-of-the-art methods participate in the following benchmarks:

ScanNet200: 200-class benchmark for dense indoor scenes. Evaluated by mAP, AP@50, and AP@25 for open-vocabulary instance segmentation (Takmaz et al., 2023, Nguyen et al., 2023, Boudjoghra et al., 2024, Jung et al., 30 Jul 2025, Huang et al., 21 Oct 2025).
Replica: Cross-dataset generalization for indoor scenes with 48/51 classes (Nguyen et al., 2023, Boudjoghra et al., 2024).
LERF-OVS, LERF-mask, MultiScan: Test fine-grained, part-level, and compositional search (Lu et al., 28 Mar 2025, Piekenbrinck et al., 9 Jun 2025, Takmaz et al., 2024).
S3DIS, Matterport3D, SUM, SensatUrban: Indoor and urban semantic segmentation, material regions (Huang et al., 21 Oct 2025, Wang et al., 13 Sep 2025, Xu et al., 2024).
Specialized human-part segmentation: MGN, SIZER, CTD, THuman2.0, PosedPro for fine-grained 3D human parsing (Suzuki et al., 27 Feb 2025).

Metrics include mIoU, mAcc, mean AP (with IoU thresholds), harmonic mean for base/novel splits, and prompt-specific region accuracy. Clear performance trends:

Open-vocabulary methods close the gap with closed-vocabulary (fully supervised) results on head/common classes, and surpass on tail/novel categories.
Prompt-decoupled pipelines yield 10–20× speedup per-query (Suzuki et al., 27 Feb 2025, Boudjoghra et al., 2024).
Gaussian Splatting approaches (Segment-then-Splat, OpenSplat3D, COS3D, SuperGSeg) exhibit strong performance on open-vocab object instance, part, and fine-attribute queries, with mIoU as high as 54.7% (ScanNetv2) and AP improvements of 7–11 points over prior methods (Piekenbrinck et al., 9 Jun 2025, Lu et al., 28 Mar 2025, Liang et al., 2024, Huang et al., 21 Oct 2025).
Hierarchical approaches (Search3D) outperform flat label assignment for part and material retrieval (Takmaz et al., 2024).

5. Architectural Innovations and Model Variants

Prompt Decoupling and Inference Efficiency: Mask and feature extraction are precomputed, with language prompt handling (text encoder + fusion) deferred to query time. This yields empirical per-prompt inference speeds of ∼1 s/scene (Suzuki et al., 27 Feb 2025), comparable to 20×–70× speedups over traditional SAM+CLIP pipelines (Boudjoghra et al., 2024).

Context-aware Mask Embeddings: OpenInsGaussian fuses local CLIP crop and masked average-pooling of encoder spatial maps, then aggregates multi-view features by similarity-attention. Combined context/local fusion achieves up to +10 mIoU over prior solutions (Huang et al., 21 Oct 2025).

Collaborative Multi-Field Supervision: COS3D maintains two collaborative Gaussian fields—an instance field (discriminative) and a language field (prompt-aligned)—with a learned mapping and adaptive query-time merging, outperforming both pure-language and pure-segmentation variants (Zhu et al., 23 Oct 2025).

Hierarchical and Part-level Search: Search3D builds a multi-level 3D scene graph (objects, parts) with SigLIP-aligned features, enabling queries at arbitrary granularity including materials and compositional attributes (Takmaz et al., 2024).

Diffusion Backbone Features: Diff2Scene leverages frozen features from text-to-image diffusion models (e.g., Stable Diffusion U-Net) for salient-aware 2D mask generation and 3D/2D mask fusion, enabling improved tail-class and compositional concept retrieval (Zhu et al., 2024).

Online 2D–3D Instance Tracking: OpenTrack3D uses an online tracker that unifies appearance (DINO) and voxelized geometric cues, allowing generalization to mesh-free and unstructured environments. Classification is enhanced with a multi-modal LLM head for complex/natural-language prompts (Zhou et al., 3 Dec 2025).

6. Limitations, Open Challenges, and Future Directions

Despite the rapid progress, current open-vocabulary 3D segmentation frameworks have the following limitations:

Dependency on 2D foundation models: Performance is bottlenecked by the recall and mask quality of 2D segmenters (e.g., GroundedSAM, SAM); over- and under-segmentation cases propagate to 3D (Piekenbrinck et al., 9 Jun 2025, Takmaz et al., 2024).
Computation and memory: Gaussian splatting and NeRF-based methods, though yielding sharp masks, incur significant per-scene computation (20–45 min typical), although prompt inference is fast (Piekenbrinck et al., 9 Jun 2025, Lu et al., 28 Mar 2025).
Small-object and fine-detail segmentation: Consensus-based 3D proposal strategies may miss small or highly occluded entities; merged proposals can harm fragmented instances (Jung et al., 30 Jul 2025).
Hierarchical limitations: Existing part-level search is typically restricted to objects or coarse parts; further work is needed for fine or arbitrary hierarchy support (Takmaz et al., 2024).
Generalization: Domain shifts, mesh quality, unaligned scans, and depth noise affect performance, especially in urban or in-the-wild settings (Wang et al., 13 Sep 2025, Xu et al., 2024).

Prospective research directions include:

Joint fine-tuning for cross-modal 2D–3D consistency
Adaptive and uncertainty-aware label fusion
Contextualized and region-dependent prompt understanding using MLLMs
Real-time or incremental frame streaming for robotic deployment
Scalable annotation (e.g., via synthetic RGB-D data or 3D mask–caption pipelines (Lee et al., 4 Feb 2025))
End-to-end learning of hierarchical, discrete part ontologies for deep 3D search

7. Specialized Applications and Dataset Resources

3D Human Semantic Part Segmentation: The first open-vocabulary framework for semantic part segmentation of 3D humans leverages SAM for proposal generation, HumanCLIP for visual–text embedding, and a MaskFusion module for efficient mask–prompt fusion. This yields state-of-the-art accuracy (mean IoU: 69.3% across five datasets) with robust generalization to arbitrary user-provided part queries and multiple 3D representations, including point clouds, meshes, and Gaussian splats (Suzuki et al., 27 Feb 2025).

Large-scale Data and Foundation Models: Mosaic3D introduces a 5.6 M mask–caption corpus generated with RAM++ / Grounded-DINO / SAM / Osprey, enabling efficient training of a sparse-conv UNet-based 3D encoder and mask decoder, with extensive validation in both semantic and instance open-vocabulary segmentation (Lee et al., 4 Feb 2025).

Urban-scale Annotation-Free Systems: OpenUrban3D introduces a pipeline that dispenses with aligned RGB-D imagery or mesh, relying on multi-view, multi-granularity 2D projections for VLM-based mask feature extraction and sample-balanced fusion, attaining mIoU up to 75.4% on SUM and 39.6% on SensatUrban in fully zero-shot settings (Wang et al., 13 Sep 2025).

These innovations collectively establish open-vocabulary 3D segmentation as a central component of next-generation 3D scene understanding, with substantial progress in both model quality and practical applicability across indoor, outdoor, synthetic, and human-centered tasks. Continued integration of richer 2D–3D joint supervision, prompt flexibility, and geometric robustness will further expand the field's capabilities and deployment domains.