ConverSeg-Net: Conversational Segmentation

Updated 18 February 2026

The paper introduces ConverSeg-Net, a novel single-pass architecture that fuses strong class-agnostic segmentation priors with grounded language understanding to generate precise binary masks.
It employs a frozen vision encoder alongside a Qwen-based language encoder with LoRA fine-tuning, effectively integrating visual and textual features via sparse and dense adapters.
The design demonstrates robust performance on both literal and abstract segmentation tasks, with mixed training strategies and ablation studies highlighting the critical role of cross-modal fusion.

ConverSeg-Net is a single-pass, conversational image-to-mask architecture designed for Conversational Image Segmentation (CIS), a task that grounds abstract, intent-driven concepts expressed in natural language into spatially accurate binary masks. Developed in the context of the ConverSeg benchmark—which spans entities, spatial relations, affordances, functions, safety, and physical reasoning—ConverSeg-Net fuses strong, class-agnostic segmentation priors with grounded language understanding to directly address both concrete and high-level, functional queries (Sahoo et al., 13 Feb 2026).

1. Architectural Overview and Motivation

ConverSeg-Net operates on two principal inputs: an RGB image $I$ (typically resized such that its long side is 1,024 pixels) and a natural-language prompt $p$ (e.g., "surfaces suitable for hot cookware"). Its output is a binary mask $M^* \in [0,1]^{H \times W}$ highlighting those pixels in $I$ that satisfy the prompt. The architecture is designed to ground both explicit referring expressions and abstract, intent-driven queries by leveraging segmentation priors and vision-language integration in a single forward pass. This approach distinguishes itself from prior work, which primarily focuses on categorical and spatial queries while omitting higher-level reasoning about function and safety.

2. Core Components and Data Flow

ConverSeg-Net is composed of four main modules: a frozen vision backbone, a language encoder with cross-modal capacity, lightweight fusion adapters, and a segmentation decoder. The data flow is summarized in the following pseudocode, which schematically details the transformations and data shapes at each stage:

z_img = SAM2_image_encoder(I)                # → [H'×W'×D_img]
h_all = Qwen_prompt_encoder(I, p)            # → [T+V]×D_t, keep only text tokens
{h1,…,hT}, h_EOS = select_text_states(h_all) # each ∈ ℝ^{D_t}

e_sparse = Linear(h1…hT)                     # → [T×D_dec]
e_dense  = MLP(h_EOS)                        # → [D_dec]
e_dense_map = broadcast(e_dense, H', W')     # → [D_dec×H'×W']

img_tokens = flatten(z_img)                  # → [H'W'×D_img]

z_out = SAM2_mask_decoder(img_tokens, e_dense_map, e_sparse) # → [H'×W'×D_dec] then upsample
M* = σ(Upsample(z_out) @ Conv1×1)             # → [H×W]

return M*

Vision Backbone (Image Encoder)

The vision backbone is based on the SAM2 MAE-pretrained Vision Transformer (ViT) adapted to high-resolution inputs. For the Hiera-L SAM2 configuration, this comprises a frozen image encoder $E_{img}$ that outputs a feature map $z_{img} = E_{img}(I) \in \mathbb{R}^{H' \times W' \times D_{img}}$ with $D_{img}=1,024$ and $H', W' \approx 1/16$ of the input spatial dimensions. All weights in this module remain fixed during training.

Language Encoder (Prompt Encoder)

ConverSeg-Net employs Qwen2.5-VL-3B, a 3 billion parameter transformer, that jointly attends to both the image and text prompt via cross-modal transformer layers. Only the text-token hidden states $\{h_1,\ldots,h_T\}$ and the EOS hidden state $h_{EOS} \in \mathbb{R}^{D_t}$ , with $D_t=2,048$ , are extracted for further processing. To facilitate adaptation to the segmentation task, Qwen is fine-tuned via LoRA applied to the Q/K/V projections (rank $r=16$ , scaling $\alpha=32$ ), while the original pre-trained weights remain unchanged.

Two lightweight adapters bridge Qwen’s text space to the SAM2 mask decoder’s prompt space ( $D_{dec}=1,024$ ):

Sparse Adapter: Projects sequence of text hidden states to segmentation prompt space via a linear layer with LayerNorm:

$e_{sparse} = \text{Linear}_{D_t \rightarrow D_{dec}}(\text{LayerNorm}([h_1; \ldots; h_T]))$

Dense Adapter: Maps the EOS hidden state through a 2-layer MLP (hidden size 2,048, SiLU activation, LayerNorm), broadcasting it spatially as a bias map:

$e_{dense} = \text{MLP}_{D_t \rightarrow D_{dec}}(\text{LayerNorm}(h_{EOS}))$

The resulting $e_{dense}$ is broadcast to $[D_{dec} \times H' \times W']$ .

Segmentation Decoder

The SAM2 Hiera-L mask decoder, fully fine-tuned, comprises two stacked transformer blocks with bidirectional cross-attention between (i) flattened image feature tokens and (ii) prompt tokens ( $e_{sparse}$ ) plus the dense bias map ( $e_{dense\_map}$ ). The cross-attention follows standard multi-head attention schemes, with subsequent learned upsampling to $H \times W$ spatial resolution. A final $1 \times 1$ convolution maps the upsampled feature map to a probability mask $M^* = \sigma(\text{Conv1x1}(\text{Upsample}(z_{final})))$ . No auxiliary classification or bounding-box heads are employed; the decoder is trained to directly predict the mask.

3. Key Hyperparameters and Module Configurations

ConverSeg-Net’s principal architectural and training hyperparameters are summarized below.

Module	Architecture	Parameters
Vision Encoder	ViT-Base (12 layers, patch 16), MAE	$D_{img}=768/1,024$
Prompt Encoder	Qwen2.5-VL-3B, 32 layers, 32 heads	$D_t=2,048$
Sparse Adapter	Linear + LayerNorm ( $2,048 \rightarrow 1,024$ )
Dense Adapter	2-layer MLP (SiLU, $2,048 \rightarrow 1,024$ )
Mask Decoder	2 transformer blocks, 8 heads each, $d_k=128$	$D_{dec}=1,024$
Activations	SiLU (adapters), GELU (transformers)
LoRA	on Qwen Q/K/V: $r=16, \alpha=32$
Normalization	LayerNorm before adapters, pre-norm in transformers

This table presents only factual details as given in (Sahoo et al., 13 Feb 2026).

4. Training Objectives and Loss Functions

The training of ConverSeg-Net leverages supervision from binary ground-truth masks, combining two losses:

$L = L_{\mathrm{BCE}}(M_{gt}, M^*) + \lambda\,L_{\mathrm{Dice}}(M_{gt}, M^*),\quad \lambda=0.25$

Binary Cross-Entropy (BCE):

$L_{\mathrm{BCE}} = -\frac{1}{HW} \sum_{i,j}\Big[ M_{ij}\log M^*_{ij} + (1-M_{ij})\log(1-M^*_{ij})\Big]$

Dice Loss:

$L_{\mathrm{Dice}} = 1 - \frac{2 \sum M \odot M^* + \epsilon}{\sum M + \sum M^* + \epsilon}$

The hybrid loss functions guide the model to optimize both per-pixel accuracy and overall spatial overlap, which is critical for producing pixel-accurate masks in both literal and abstract segmentation queries.

5. Architectural Variants and Ablation Findings

Empirical analysis of model design choices is provided via multiple architectural variants:

Prompt Encoder Size: Upgrading from Qwen2.5-3B to Qwen2.5-7B provides a 1.6 percentage point (pp) gain in ConverSeg generalized IoU.
LoRA Ablation: Freezing the prompt encoder (removing LoRA adaptation) incurs a 19 pp drop in gIoU, underscoring the necessity of adapting Qwen parameters to segmentation.
Text-Only Input: Providing only the prompt (omitting the image) to Qwen results in a –17.9 pp drop, demonstrating the essential role of visual context within the prompt encoder.
Sparse vs. Dense Adapters: Excluding the dense (EOS) adapter costs 0.1 pp; removing sparse token adapters yields greater performance degradation.
Training Curriculum:
- Training solely on literal concepts (COCO/RefCOCO) yields strong literal referring accuracy but subpar performance on conversational concepts (56% gIoU on ConverSeg).
- Training only on conversational prompts overfits and harms literal referring ability.
- A two-phase mixed curriculum, where phase 1 uses literal grounding and phase 2 interleaves both literal (groups 1–3) and abstract (group 4) prompts, achieves the best joint result (74.5% RefCOCO, 67.4% ConverSeg).

These findings reveal the importance of both model components (esp. LoRA and vision-language fusion) and training regimens in achieving state-of-the-art CIS performance.

6. Design Significance and Broader Implications

ConverSeg-Net’s design enables efficient and direct grounding of both concrete referring expressions and abstract conversational queries. By leveraging segmentation mask priors from SAM2, grounded language representations from Qwen2.5-VL, and minimal, well-placed adapters, the architecture avoids extensive re-engineering, remains scalable, and supports robust, single-pass conversational image segmentation. The ablation studies confirm the necessity of integrating cross-modal cues at the earliest stage (prompt encoder) and the benefit of mixing literal and abstract training exemplars.

A plausible implication is that this modular, adapter-based fusion could inspire further research in multi-modal segmentation and grounding tasks where adaptability to new forms of abstract language is required. Furthermore, the model’s successful fusion strategy underscores the advantage of building on strong segmentation and language priors—each pre-trained independently but minimally bridged—to maximize both data efficiency and downstream performance (Sahoo et al., 13 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ConverSeg-Net Architecture.