Zero-Shot Segmentation
- Zero-shot segmentation is a computer vision paradigm that assigns semantic labels to unseen classes by transferring information from labeled seen classes using textual embeddings.
- Core methodologies include generative feature synthesis, context-aware pipelines, and visual-semantic projection techniques, validated through metrics like harmonic mean mIoU.
- Advanced techniques such as bias correction and transductive learning enhance performance across diverse modalities, including 3D point clouds, meshes, and remote sensing imagery.
Zero-shot segmentation (ZSS) is a paradigm in computer vision where a segmentation model must correctly assign semantic labels—including those for categories never seen at training time—to each pixel (or element) of an input image, point cloud, or mesh. Unlike classical segmentation, which presupposes exhaustive pixel-level annotations for all classes, ZSS leverages auxiliary sources such as semantic word embeddings or textual class descriptions to facilitate knowledge transfer from labeled "seen" classes to "unseen" ones. Such settings are motivated by the practical impossibility of acquiring dense supervision for the expanding universe of real-world visual concepts.
1. Problem Formulation and Evaluation Protocols
ZSS partitions the set of labels into disjoint "seen" (S) and "unseen" (U) classes (S ∩ U = ∅), where pixel-level annotations are provided only for S during training. The model is required to predict, at inference, fine-grained segmentations covering S ∪ U. The most common evaluation regimes are:
- Generalized ZSS: The model is evaluated on its ability to segment both S and U simultaneously, with metrics such as mean intersection-over-union (mIoU) computed for each subgroup and their harmonic mean as the headline measure. The harmonic mean penalizes false positives for U and rewards balanced performance (Liu et al., 2020, Bucher et al., 2019).
- Conventional (Vanilla) ZSS: Evaluation is restricted only to unseen classes U, reporting mean IoU or mAP over U (Liu et al., 2020, He et al., 2023).
- Inductive vs. Transductive Settings: In the inductive regime, only labeled S-class data is used at training. The transductive regime, in contrast, permits access to the pool of unlabeled images containing unseen classes during training, exploited via losses on the unlabeled target domain (Liu et al., 2020, Kim et al., 2023).
2. Core Methodological Families
Several architectural strategies have been developed for ZSS, each addressing the core challenge of transfer across the semantic gap.
2.1. Generative Feature Synthesis
Pioneered by ZS3Net (Bucher et al., 2019), this approach learns a conditional feature generator G(z,a), mapping semantic (word) embeddings a and noise z to synthetic visual features in the backbone's space. These synthetic features for U are used to train the final classifier, and sometimes enhanced with self-training or graph-contextual modeling (Wang et al., 2021, Cheng et al., 2021). Key loss functions are maximum mean discrepancy (MMD) for matching real and fake distributions, and cross-entropy for label prediction.
2.2. Context-aware and Spatially Structured Pipelines
Contextual modeling improves generalization by encoding visual or spatial cues unavailable from class embeddings alone. Methods such as CaGNet (Gu et al., 2020, Gu et al., 2020) inject pixel-wise multi-scale context (e.g., through dilated convolutions and contextual selectors) into the feature generator, resulting in more diverse and robust synthesized features for U. SIGN (Cheng et al., 2021) augments feature maps with relative positional encodings, enabling the architecture to incorporate spatial priors and better localize unseen categories, and introduces annealed self-training to assign adaptive confidence weights to pseudo-labels.
2.3. Visual-Semantic Embedding Projections
Another family utilizes learnable projections from visual feature spaces to class embedding spaces (derived from text). Projections are typically realized through 1×1 convolutions or MLPs, and similarity-based scoring yields class probabilities. Standard architectures such as FCN/DeepLabv2 with visual-semantic projection heads have served as the backbone (Liu et al., 2020). To combat seen-class bias, transductive bias-alleviation losses are employed, e.g., enforcing the sum of U-class softmax probabilities to remain significant for pixels in unlabeled target images (Liu et al., 2020).
2.4. Vision-LLM Distillation and Prompt Engineering
Recent ZSS advancements leverage pretrained vision-LLMs such as CLIP. Methods like ZegOT (Kim et al., 2023) utilize frozen CLIP encoders, in conjunction with trainable prompt vectors and optimal transport solvers (MPOT), to align text and image tokens for improved open-vocabulary segmentation. Pseudo-label self-training and prompt-hierarchy design are explored to expand coverage and discrimination. Furthermore, frameworks such as CLIP-ZSS (Chen et al., 2023) show that explicit distillation of CLIP's classification knowledge into dense segmentation backbones, via supervised and pseudo-supervised signals, can achieve high inductive ZSS performance.
3. Bias Correction and Transductive Learning
A recurrent issue is the intrinsic bias toward seen categories—models tend to classify ambiguous or novel pixels as one of the classes present during training. Transductive bias-correction mechanisms, such as negative log-mass losses over unseen classes on unlabeled data, have been found effective (Liu et al., 2020). These approaches prefer regularizing probability mass toward the full set of U, rather than relying on potentially noisy pseudo-labels.
Transductive training proceeds as follows: for each minibatch, a segmentation loss over labeled source pixels is combined with a bias-alleviation term computed on target images. After several epochs, (optional) pseudo-labeling followed by further training can be used to refine predictions, provided confidence thresholds are set judiciously. Empirically, harmonic mean mIoU may increase from ~23% to >50% on PASCAL VOC splits when transductive rectification is applied (Liu et al., 2020).
4. Architectural Variants Across Modalities
ZSS frameworks have also been extended to 3D modalities and remote-sensing imagery:
- 3D Point Cloud Segmentation: Approaches such as See More and Know More (Lu et al., 2023) and 3D-PointZshotS (Yang et al., 16 Apr 2025) integrate point cloud geometry with word vectors, aligning visual and semantic prototypes through contrastive or cross-attention objectives. Multi-modal fusions of LiDAR and image data with attention-driven gating further bridge modality-specific gaps.
- Mesh Segmentation: MeshSegmenter (Zhong et al., 2024) employs foundation models (SAM, GroundingDINO) over multi-view renderings and combines results with a voting scheme to project 2D predictions back onto the mesh, with optional stable diffusion–based texture augmentation.
- Remote Sensing: ZoRI (Huang et al., 2024) adds discriminative channel selection, backbone adaptation, and prototype cache retrieval tailored to domain shifts in aerial imagery, achieving outstanding performance on remote-sensing benchmarks.
5. Experimental Protocols and Benchmarks
Standard ZSS benchmarks are derived from semantic segmentation datasets (e.g., PASCAL VOC/Context, COCO-Stuff), partitioning classes into S and U according to fixed splits (e.g., 15/5 for VOC). Evaluation metrics typically include:
- mIoU_s: mean IoU over seen classes S.
- mIoU_u: mean IoU over unseen classes U.
- H: harmonic mean of mIoU_s and mIoU_u.
- [email protected], Recall@100 (instance-level and proposal-based settings).
Transductive methods exploit both labeled and unlabeled images during training, whereas inductive frameworks rely only on labeled S data (Liu et al., 2020, Chen et al., 2023).
Empirically, inclusion of bias-alleviation, context, and self-training has been found to significantly increase unseen-class segmentation rates, and especially the balanced performance indicated by H. Qualitative analysis corroborates that appropriate bias correction yields correct mask boundaries and class assignments for U, where inductive or naive approaches fail.
6. Limitations, Open Problems, and Future Research
Several challenges remain unresolved:
- Semantic Gap: The embedding quality and relation between S and U in the word vector space remain critical. Some splits with weak S-U relations remain challenging (Liu et al., 2020).
- Assumptions of Data Access: Transductive approaches depend on access to unlabeled target-domain images, which may not always be practical in true zero-shot settings.
- Pseudo-labeling Risks: Overconfident or misaligned pseudo-labels may degrade model performance, particularly when S-U visual distribution drift is high.
- Embedding Choice: Incorporation of richer language backbones (e.g., BERT, GloVe) or compositional prompt design may further enhance transfer, but their optimal use in dense segmentation remains an open area.
- Open-set and Panoptic Extensions: Generalization to open-vocabulary panoptic segmentation or scalable hierarchical taxonomies is an ongoing direction (Huang et al., 2024, Ge et al., 2024).
7. Representative Methodological Table
| Method | Key Mechanism | Setting | Notable Result (VOC h) |
|---|---|---|---|
| ZS3Net (Bucher et al., 2019) | Feature generator (MMD) | Inductive/ST | ≈47.5 |
| CaGNet (Gu et al., 2020) | Context-aware feature generation | Inductive | ≈40–43 |
| Bias-Rectification (Liu et al., 2020) | Transductive bias loss | Transductive | ≈50–54 |
| ZegOT (Kim et al., 2023) | CLIP+prompt+OT, frozen encoder | Transductive/GZSS | 91.4 |
| CLIP-ZSS (Chen et al., 2023) | Distill CLIP global/local info | Inductive | 86.5 (VOC) |
All results are harmonic mean IoU (“h”) on PASCAL VOC, as reported in the referenced works.
ZSS stands as a principal testbed for scalable, flexible, and robust dense recognition in the era where semantic diversity outpaces supervised annotation, and continues to drive methodological innovation at the intersection of vision and language (Liu et al., 2020, Chen et al., 2023, Kim et al., 2023).