SimpleSeg: Decoder-Free Segmentation

Updated 3 February 2026

The paper introduces SimpleSeg, a decoder-free method that reformulates pixel-level segmentation as a sequence generation problem using normalized 2D point coordinates.
SimpleSeg leverages standard autoregressive transformers along with a two-stage training pipeline, combining supervised fine-tuning and reinforcement learning with an IoU-based reward for geometric precision.
Experimental results on RefCOCO benchmarks demonstrate that SimpleSeg achieves competitive cIoU and [email protected] scores, highlighting the latent spatial capabilities of multimodal large language models.

SimpleSeg is a decoder-free method that formulates pixel-level segmentation within Multimodal LLMs (MLLMs) as a direct sequence generation problem. By emitting sequences of normalized 2D point coordinates to delineate object boundaries, SimpleSeg achieves accurate semantic segmentation entirely in the language modeling regime, with no architectural modifications or additional decoder modules. The approach leverages standard autoregressive transformers and exploits the intrinsic spatial capabilities of MLLMs through a two-stage training pipeline involving supervised fine-tuning and reinforcement learning with an intersection-over-union (IoU)-based geometric reward (Song et al., 27 Jan 2026).

1. Sequence-Based Segmentation Formulation

SimpleSeg reframes segmentation by asking the MLLM to generate, in text form, a variable-length sequence of 2D points that describe object boundaries. Given an input image $x$ and a grounding query (such as a referring expression or a seed point), the model predicts

$\texttt{[ [x_1, y_1], [x_2, y_2], \dots, [x_V, y_V] ]}$

where each $(x_i, y_i)$ is normalized to lie in $[0,1]$ by the image width $W$ and height $H$ . This structure is enforced via a minimal JSON-like grammar:

$[\,[x_1, y_1], [x_2, y_2], \ldots, [x_V, y_V]\,]$

Tokens are standard UTF-8 characters, so the output remains within the model's text space. Polygonal contours are extracted using algorithms such as Suzuki–Abe, enforced to traverse clockwise, and may be sparsified by a tolerance parameter $\epsilon$ to balance boundary fidelity against sequence length. The decoded polygon is then filled to yield a segmentation mask.

2. Model Architecture and Input/Output Interface

SimpleSeg operates on unmodified MLLMs, as demonstrated on exemplars such as Qwen2.5-VL-7B and Kimi-VL (a 2.8B parameter Mixture-of-Experts model). No new parameters or task-specific decoder heads are introduced; segmentation is handled entirely by adjusting the autoregressive text output policy of the backbone model. Input queries are handled through a family of "conversion tasks" over the 4-tuple $[\texttt{text}, \texttt{point}, \texttt{bbox}, \texttt{mask}]$ , with prompts such as:

“Q: What is the bounding box of <text>? A: <bbox>.”
“Q: Give the polygon of the object at <point>? A: <mask>.”

During inference, image patch embeddings and the grounding query are prepended, and the model emits the coordinate sequence as output. Tokens are parsed into polygons, with sequence termination enforced by the final closing bracket. The number of points $V$ adapts to object complexity and is controlled by the sparsification tolerance $\epsilon$ . Invalid token outputs (e.g., broken formatting) yield a zero "format reward" during RL-based training.

3. Supervised Fine-tuning and Reinforcement Learning Pipeline

The core training pipeline consists of two sequential stages:

Stage I: Supervised Fine-Tuning (SFT)

Training pairs of (image + query) and ground-truth polygons extracted from datasets such as RefCOCO, RefCOCO+, RefCOCOg, and RefCLEF (≈800K samples).
Optimization target is standard token-level cross-entropy:

$\mathcal{L}_{\mathrm{SFT}} = -\sum_{t=1}^{T} \log p_\theta(y_t | y_{<t}, x)$

SFT teaches the grammar, object grounding, polygon closure, and traversal order.

Stage II: Reinforcement Learning (RL) with GSPO

Polygon outputs being many-to-one with respect to masks, RL is used to optimize geometry-aware metrics.
The policy $\pi_\theta(y|x)$ is fine-tuned from SFT parameters.
The total reward $R(y)$ $R (y)$ includes:
1. Mask IoU reward:
  
  $R_{\mathrm{IoU}} = \frac{| M(y) \cap M_{\mathrm{gt}} |}{| M(y) \cup M_{\mathrm{gt}} |}$
  
  truncated to $[0, 0.1]$ , and zero if below a threshold $\tau$ ;
2. Centroid distance reward (normalized MSE);
3. Format reward (1 if the decoded polygon is valid JSON, 0 otherwise).
The RL objective with KL penalty (to prevent policy drift) is:

$J_{\mathrm{RL}}(\theta) = \mathbb{E}_{y \sim \pi_\theta} [R(y)] - \lambda_{\mathrm{KL}}\, \mathrm{KL}(\pi_\theta\,\|\;\pi_{\mathrm{ref}})$

Policy gradient update:

$\nabla_\theta J_{\mathrm{RL}} = \mathbb{E}_{y \sim \pi_\theta} [(R(y)-b)\,\nabla_\theta\log \pi_\theta(y|x)] - \lambda_{\mathrm{KL}}\,\nabla_\theta \mathrm{KL}(\pi_\theta\,\|\,\pi_{\mathrm{ref}})$

The Group-structured Proximal Optimization (GSPO) algorithm, a variant of PPO, is employed for optimization.
Empirically, the RL stage yields an increase of 9–10 gIoU points over SFT alone.

RL Training Pseudocode

initialize θ ← θ_SFT
for epoch in 1…N_rl:
  for batch in sample_batches(data):
    # Sample K trajectories
    y_i ∼ π_θ(·|x) for i in 1...K
    # Compute the batch of rewards
    Compute R_i for each y_i
    # Compute gradient and update
    ∇θ J_RL ← policy_gradient(R_i, π_θ)
    θ ← θ + optimizer_step(∇θ J_RL)

4. Experimental Design and Quantitative Results

SimpleSeg is validated on the referring expression segmentation benchmarks RefCOCO, RefCOCO+, and RefCOCOg, with cIoU (mean IoU over all referring instances) and [email protected] (proportion with IoU ≥ 0.5) as metrics. Experimental parameters include:

SFT: 1 epoch over 800K samples, learning rate $5 \times 10^{-5} \to 2 \times 10^{-6}$ , batch size 256.
RL: GSPO 2 epochs over 400K samples, learning rate $2 \times 10^{-6}$ , clip $\in [3 \times 10^{-4}, 4 \times 10^{-4}]$ , KL regularization $\lambda_{\mathrm{KL}} = 0.01$ .
Sparsification tolerance $\epsilon$ tuned for $\approx 200$ tokens per polygon.

Key comparative results:

Method	cIoU (avg)	REC [email protected]
Decoder-free baseline	71.4	—
SimpleSeg (SFT+RL, no pre-train)	73.6	—
SimpleSeg (SFT+RL, pre-trained)	74.8	87.2

SimpleSeg achieves cIoU and [email protected] comparable to, and frequently exceeding, decoder-based approaches, despite using zero extra parameters beyond the MLLM backbone.

5. Qualitative Assessment and Failure Analysis

SimpleSeg exhibits high mask fidelity, producing crisp, structurally faithful boundaries on a wide range of visual domains, including photographs, infographics, cartoons, and scientific diagrams. However, limitations include:

Highly convoluted shapes with holes: As object complexity increases, the required number of tokens for adequate contour sampling can exceed the model's token budget under fixed $\epsilon$ .
Sparsification effects: Thin structures and sharp corners may be inadequately captured when polygonal sampling under-represents curvature detail.
Format errors: Outputting an invalid polygon sequence (e.g., malformed JSON), leading to failure under the RL reward scheme.

Suggested further diagnostics include boundary F₁, vertex Chamfer distance, and token-per-mask rates for deeper evaluation of segmentation quality and efficiency.

6. Implications and Extensions

SimpleSeg demonstrates that standard MLLMs possess latent low-level spatial understanding, previously assumed to require dedicated vision modules. Its success with zero architectural modification challenges the necessity of task-specific decoders for pixel-accurate segmentation tasks.

Extensions facilitated by the unified point-sequence interface include:

Panoptic segmentation by outputting multiple polygons.
Point-to-mask (“SAM-style”) and box-to-mask conversions with a unified model prompt and output schema.
Applications in part segmentation, interactive segmentation, graphical user interface grounding, and image annotation.

Potential future work includes integrating point sequence generation as a pre-training objective, exploring alternative sequence-level metrics (e.g., boundary F₁, token-cost penalties), and scaling up model size and dataset diversity to further enhance spatial grounding and mask fidelity within the language modeling paradigm.

7. Summary

By representing segmentation masks as normalized point sequences, employing SFT to teach the model the polygon grammar and global structure, and refining with an IoU-driven RL stage, SimpleSeg enables accurate, decoder-free, pixel-level segmentation entirely within modern vision-LLMs (Song et al., 27 Jan 2026). This approach establishes a new paradigm for unifying spatial and linguistic understanding in MLLMs, revealing strong emergent spatial perception without additional architectural complexity.

Markdown Report Issue Upgrade to Chat

References (1)

Towards Pixel-Level VLM Perception via Simple Points Prediction (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SimpleSeg.

SimpleSeg: Decoder-Free Segmentation

1. Sequence-Based Segmentation Formulation

2. Model Architecture and Input/Output Interface

3. Supervised Fine-tuning and Reinforcement Learning Pipeline

Stage I: Supervised Fine-Tuning (SFT)

Stage II: Reinforcement Learning (RL) with GSPO

RL Training Pseudocode

4. Experimental Design and Quantitative Results

5. Qualitative Assessment and Failure Analysis

6. Implications and Extensions

7. Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SimpleSeg: Decoder-Free Segmentation

1. Sequence-Based Segmentation Formulation

2. Model Architecture and Input/Output Interface

3. Supervised Fine-tuning and Reinforcement Learning Pipeline

Stage I: Supervised Fine-Tuning (SFT)

Stage II: Reinforcement Learning (RL) with GSPO

RL Training Pseudocode

4. Experimental Design and Quantitative Results

5. Qualitative Assessment and Failure Analysis

6. Implications and Extensions

7. Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research