Papers
Topics
Authors
Recent
Search
2000 character limit reached

SimpleSeg: Decoder-Free Segmentation

Updated 3 February 2026
  • The paper introduces SimpleSeg, a decoder-free method that reformulates pixel-level segmentation as a sequence generation problem using normalized 2D point coordinates.
  • SimpleSeg leverages standard autoregressive transformers along with a two-stage training pipeline, combining supervised fine-tuning and reinforcement learning with an IoU-based reward for geometric precision.
  • Experimental results on RefCOCO benchmarks demonstrate that SimpleSeg achieves competitive cIoU and [email protected] scores, highlighting the latent spatial capabilities of multimodal large language models.

SimpleSeg is a decoder-free method that formulates pixel-level segmentation within Multimodal LLMs (MLLMs) as a direct sequence generation problem. By emitting sequences of normalized 2D point coordinates to delineate object boundaries, SimpleSeg achieves accurate semantic segmentation entirely in the language modeling regime, with no architectural modifications or additional decoder modules. The approach leverages standard autoregressive transformers and exploits the intrinsic spatial capabilities of MLLMs through a two-stage training pipeline involving supervised fine-tuning and reinforcement learning with an intersection-over-union (IoU)-based geometric reward (Song et al., 27 Jan 2026).

1. Sequence-Based Segmentation Formulation

SimpleSeg reframes segmentation by asking the MLLM to generate, in text form, a variable-length sequence of 2D points that describe object boundaries. Given an input image xx and a grounding query (such as a referring expression or a seed point), the model predicts

$\texttt{[ [x_1, y_1], [x_2, y_2], \dots, [x_V, y_V] ]}$

where each (xi,yi)(x_i, y_i) is normalized to lie in [0,1][0,1] by the image width WW and height HH. This structure is enforced via a minimal JSON-like grammar:

[[x1,y1],[x2,y2],,[xV,yV]][\,[x_1, y_1], [x_2, y_2], \ldots, [x_V, y_V]\,]

Tokens are standard UTF-8 characters, so the output remains within the model's text space. Polygonal contours are extracted using algorithms such as Suzuki–Abe, enforced to traverse clockwise, and may be sparsified by a tolerance parameter ϵ\epsilon to balance boundary fidelity against sequence length. The decoded polygon is then filled to yield a segmentation mask.

2. Model Architecture and Input/Output Interface

SimpleSeg operates on unmodified MLLMs, as demonstrated on exemplars such as Qwen2.5-VL-7B and Kimi-VL (a 2.8B parameter Mixture-of-Experts model). No new parameters or task-specific decoder heads are introduced; segmentation is handled entirely by adjusting the autoregressive text output policy of the backbone model. Input queries are handled through a family of "conversion tasks" over the 4-tuple [text,point,bbox,mask][\texttt{text}, \texttt{point}, \texttt{bbox}, \texttt{mask}], with prompts such as:

  • “Q: What is the bounding box of <text>? A: <bbox>.”
  • “Q: Give the polygon of the object at <point>? A: <mask>.”

During inference, image patch embeddings and the grounding query are prepended, and the model emits the coordinate sequence as output. Tokens are parsed into polygons, with sequence termination enforced by the final closing bracket. The number of points VV adapts to object complexity and is controlled by the sparsification tolerance ϵ\epsilon. Invalid token outputs (e.g., broken formatting) yield a zero "format reward" during RL-based training.

3. Supervised Fine-tuning and Reinforcement Learning Pipeline

The core training pipeline consists of two sequential stages:

Stage I: Supervised Fine-Tuning (SFT)

  • Training pairs of (image + query) and ground-truth polygons extracted from datasets such as RefCOCO, RefCOCO+, RefCOCOg, and RefCLEF (≈800K samples).
  • Optimization target is standard token-level cross-entropy:

LSFT=t=1Tlogpθ(yty<t,x)\mathcal{L}_{\mathrm{SFT}} = -\sum_{t=1}^{T} \log p_\theta(y_t | y_{<t}, x)

  • SFT teaches the grammar, object grounding, polygon closure, and traversal order.

Stage II: Reinforcement Learning (RL) with GSPO

  • Polygon outputs being many-to-one with respect to masks, RL is used to optimize geometry-aware metrics.
  • The policy πθ(yx)\pi_\theta(y|x) is fine-tuned from SFT parameters.
  • The total reward R(y)R(y) includes:

    1. Mask IoU reward:

      RIoU=M(y)MgtM(y)MgtR_{\mathrm{IoU}} = \frac{| M(y) \cap M_{\mathrm{gt}} |}{| M(y) \cup M_{\mathrm{gt}} |}

      truncated to [0,0.1][0, 0.1], and zero if below a threshold τ\tau;

    2. Centroid distance reward (normalized MSE);
    3. Format reward (1 if the decoded polygon is valid JSON, 0 otherwise).
  • The RL objective with KL penalty (to prevent policy drift) is:

JRL(θ)=Eyπθ[R(y)]λKLKL(πθ  πref)J_{\mathrm{RL}}(\theta) = \mathbb{E}_{y \sim \pi_\theta} [R(y)] - \lambda_{\mathrm{KL}}\, \mathrm{KL}(\pi_\theta\,\|\;\pi_{\mathrm{ref}})

θJRL=Eyπθ[(R(y)b)θlogπθ(yx)]λKLθKL(πθπref)\nabla_\theta J_{\mathrm{RL}} = \mathbb{E}_{y \sim \pi_\theta} [(R(y)-b)\,\nabla_\theta\log \pi_\theta(y|x)] - \lambda_{\mathrm{KL}}\,\nabla_\theta \mathrm{KL}(\pi_\theta\,\|\,\pi_{\mathrm{ref}})

  • The Group-structured Proximal Optimization (GSPO) algorithm, a variant of PPO, is employed for optimization.
  • Empirically, the RL stage yields an increase of 9–10 gIoU points over SFT alone.

RL Training Pseudocode

1
2
3
4
5
6
7
8
9
10
initialize θ  θ_SFT
for epoch in 1N_rl:
  for batch in sample_batches(data):
    # Sample K trajectories
    y_i  π_θ(·|x) for i in 1...K
    # Compute the batch of rewards
    Compute R_i for each y_i
    # Compute gradient and update
    θ J_RL  policy_gradient(R_i, π_θ)
    θ  θ + optimizer_step(θ J_RL)

4. Experimental Design and Quantitative Results

SimpleSeg is validated on the referring expression segmentation benchmarks RefCOCO, RefCOCO+, and RefCOCOg, with cIoU (mean IoU over all referring instances) and [email protected] (proportion with IoU ≥ 0.5) as metrics. Experimental parameters include:

  • SFT: 1 epoch over 800K samples, learning rate 5×1052×1065 \times 10^{-5} \to 2 \times 10^{-6}, batch size 256.
  • RL: GSPO 2 epochs over 400K samples, learning rate 2×1062 \times 10^{-6}, clip [3×104,4×104]\in [3 \times 10^{-4}, 4 \times 10^{-4}], KL regularization λKL=0.01\lambda_{\mathrm{KL}} = 0.01.
  • Sparsification tolerance ϵ\epsilon tuned for 200\approx 200 tokens per polygon.

Key comparative results:

Method cIoU (avg) REC [email protected]
Decoder-free baseline 71.4
SimpleSeg (SFT+RL, no pre-train) 73.6
SimpleSeg (SFT+RL, pre-trained) 74.8 87.2

SimpleSeg achieves cIoU and [email protected] comparable to, and frequently exceeding, decoder-based approaches, despite using zero extra parameters beyond the MLLM backbone.

5. Qualitative Assessment and Failure Analysis

SimpleSeg exhibits high mask fidelity, producing crisp, structurally faithful boundaries on a wide range of visual domains, including photographs, infographics, cartoons, and scientific diagrams. However, limitations include:

  • Highly convoluted shapes with holes: As object complexity increases, the required number of tokens for adequate contour sampling can exceed the model's token budget under fixed ϵ\epsilon.
  • Sparsification effects: Thin structures and sharp corners may be inadequately captured when polygonal sampling under-represents curvature detail.
  • Format errors: Outputting an invalid polygon sequence (e.g., malformed JSON), leading to failure under the RL reward scheme.

Suggested further diagnostics include boundary F₁, vertex Chamfer distance, and token-per-mask rates for deeper evaluation of segmentation quality and efficiency.

6. Implications and Extensions

SimpleSeg demonstrates that standard MLLMs possess latent low-level spatial understanding, previously assumed to require dedicated vision modules. Its success with zero architectural modification challenges the necessity of task-specific decoders for pixel-accurate segmentation tasks.

Extensions facilitated by the unified point-sequence interface include:

  • Panoptic segmentation by outputting multiple polygons.
  • Point-to-mask (“SAM-style”) and box-to-mask conversions with a unified model prompt and output schema.
  • Applications in part segmentation, interactive segmentation, graphical user interface grounding, and image annotation.

Potential future work includes integrating point sequence generation as a pre-training objective, exploring alternative sequence-level metrics (e.g., boundary F₁, token-cost penalties), and scaling up model size and dataset diversity to further enhance spatial grounding and mask fidelity within the language modeling paradigm.

7. Summary

By representing segmentation masks as normalized point sequences, employing SFT to teach the model the polygon grammar and global structure, and refining with an IoU-driven RL stage, SimpleSeg enables accurate, decoder-free, pixel-level segmentation entirely within modern vision-LLMs (Song et al., 27 Jan 2026). This approach establishes a new paradigm for unifying spatial and linguistic understanding in MLLMs, revealing strong emergent spatial perception without additional architectural complexity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SimpleSeg.