Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision Language Models

Published 24 Jun 2025 in cs.CV | (2506.19300v1)

Abstract: Open-Vocabulary Camouflaged Object Segmentation (OVCOS) seeks to segment and classify camouflaged objects from arbitrary categories, presenting unique challenges due to visual ambiguity and unseen categories.Recent approaches typically adopt a two-stage paradigm: first segmenting objects, then classifying the segmented regions using Vision LLMs (VLMs).However, these methods (1) suffer from a domain gap caused by the mismatch between VLMs' full-image training and cropped-region inference, and (2) depend on generic segmentation models optimized for well-delineated objects, making them less effective for camouflaged objects.Without explicit guidance, generic segmentation models often overlook subtle boundaries, leading to imprecise segmentation.In this paper,we introduce a novel VLM-guided cascaded framework to address these issues in OVCOS.For segmentation, we leverage the Segment Anything Model (SAM), guided by the VLM.Our framework uses VLM-derived features as explicit prompts to SAM, effectively directing attention to camouflaged regions and significantly improving localization accuracy.For classification, we avoid the domain gap introduced by hard cropping.Instead, we treat the segmentation output as a soft spatial prior via the alpha channel, which retains the full image context while providing precise spatial guidance, leading to more accurate and context-aware classification of camouflaged objects.The same VLM is shared across both segmentation and classification to ensure efficiency and semantic consistency.Extensive experiments on both OVCOS and conventional camouflaged object segmentation benchmarks demonstrate the clear superiority of our method, highlighting the effectiveness of leveraging rich VLM semantics for both segmentation and classification of camouflaged objects.

Abstract PDF Upgrade to Chat

Summary

The paper presents the COCUS framework that decouples segmentation and classification to better localize camouflaged objects.
It employs a modified SAM integrated with CLIP-derived prompts to enhance edge detection and segmentation precision in low-contrast conditions.
Experimental results demonstrate state-of-the-art performance, validating the cascaded, region-aware strategy across challenging benchmarks.

COCUS: Cascaded Open-Vocabulary Camouflaged Object Understanding Network

This paper introduces a novel two-stage framework, named Cascaded Open-vocabulary Camouflaged UnderStanding network (COCUS), for Open-Vocabulary Camouflaged Object Segmentation (OVCOS). The approach addresses the challenges of segmenting and classifying camouflaged objects from arbitrary categories, focusing on scenarios with visual ambiguity and unseen classes. The framework leverages a Vision LLM (VLM)-guided approach, specifically using a modified Segment Anything Model (SAM) and CLIP, to enhance segmentation and classification accuracy.

Background and Motivation

OVCOS presents unique difficulties due to the low contrast, indistinct boundaries, and high similarity between camouflaged objects and their backgrounds. Existing methods often rely on generic segmentation models that are not optimized for camouflaged objects, leading to imprecise localization. One-stage methods using VLMs for pixel-wise classification suffer from granularity mismatch, as VLMs are pre-trained for image-level understanding. Two-stage methods, which first segment and then classify, still face challenges in the segmentation stage due to the limitations of generic models.

Proposed Methodology

The COCUS framework addresses these issues by explicitly decoupling segmentation and classification into two cascaded stages. (Figure 1) illustrates the overall architecture, highlighting the flow of information between the segmentation and classification stages.

Figure 1: Overview of the cascaded segment and classify framework.

Stage 1: VLM-Guided Segmentation

In the first stage, the framework employs an adapted SAM model, guided by visual and textual embeddings from CLIP, to generate a class-agnostic segmentation mask. This involves integrating CLIP-derived features as prompts to SAM, which enhances attention to camouflaged regions and improves localization accuracy. The mask decoder is enhanced with conditional multi-way attention and an edge-aware refinement module to improve boundary precision. (Figure 2) shows the difference between using a generic segmentation model and visual-language prompted segmentation.

Figure 2: Comparison of mask-guided classification strategies.

Stage 2: Region-Aware Open-Vocabulary Classification

The second stage focuses on open-vocabulary classification using the segmentation output as a soft spatial prior. Instead of hard cropping the segmented regions, the framework treats the segmentation output as an alpha channel, retaining the full image context while providing precise spatial guidance. This approach mitigates the domain gap caused by the mismatch between VLMs' full-image training and cropped-region inference. The same VLM is shared across both stages to ensure efficiency and semantic consistency.

CLIP Fine-Tuning Pipeline

To enhance the sensitivity of CLIP to camouflaged objects, the framework fine-tunes CLIP using a multi-modal prompting strategy, jointly optimizing visual and textual prompts. (Figure 3) illustrates this fine-tuning pipeline, detailing how textual prompts are appended to the language branch and visual prompts are injected into the vision branch.

Figure 3: Overview of the cascaded segment and classify framework.

This process enhances semantic alignment and task-specific adaptability, enabling region-aware classification without disrupting global semantics. The visual and textual embeddings are processed to compute similarity scores, which are then used to calculate a cross-entropy loss against ground-truth class labels.

Experimental Results

The authors conducted extensive experiments on both OVCOS and conventional camouflaged object segmentation benchmarks. On the OVCamo dataset, the COCUS framework achieved state-of-the-art performance, surpassing existing methods across multiple evaluation metrics, including $cS_m$ , $cF_{\beta}^w$ , $cMAE$ , $cF_{\beta}$ , $cE_m$ , and $cIoU$ . The adapted SAM also demonstrated strong performance on the conventional COS task, validating the effectiveness of the framework across both open- and closed-set camouflaged segmentation scenarios. Qualitative results (Figure 4) further illustrate the framework's ability to accurately delineate camouflaged objects with well-preserved shapes and precise boundaries, even in low-contrast and cluttered backgrounds.

Figure 4: Qualitative comparison between our method and CLIP-based baselines on OVCamo.

Ablation studies validated the effectiveness of the fine-tuned CLIP model and the contributions of the Conditional Multi-Way Attention (CMA) and Edge Enhancement (EDE) modules in the adapted mask decoder.

Discussion

The paper highlights the significance of leveraging rich VLM semantics for both segmentation and classification of camouflaged objects. By explicitly decoupling segmentation and classification and using prompt-based guidance, the COCUS framework achieves more accurate semantic interpretation of camouflaged objects. The region-aware classification strategy, which avoids hard cropping, is crucial for mitigating the domain gap and improving classification accuracy.

Conclusion

The COCUS framework presents a significant advancement in OVCOS by addressing the limitations of existing methods and achieving state-of-the-art performance on challenging benchmarks. The framework's cascaded design, VLM-guided segmentation, and region-aware classification strategy offer a robust solution for segmenting and classifying camouflaged objects from arbitrary categories. The results confirm the benefits of the two-stage framework and edge-aware enhancements in complex camouflage scenarios. Future research could explore the application of this framework to other open-vocabulary tasks and investigate the use of more advanced VLMs for further performance improvements.

Markdown Report Issue