SimpleSeg: Decoder-Free Segmentation
- The paper introduces SimpleSeg, a decoder-free method that reformulates pixel-level segmentation as a sequence generation problem using normalized 2D point coordinates.
- SimpleSeg leverages standard autoregressive transformers along with a two-stage training pipeline, combining supervised fine-tuning and reinforcement learning with an IoU-based reward for geometric precision.
- Experimental results on RefCOCO benchmarks demonstrate that SimpleSeg achieves competitive cIoU and [email protected] scores, highlighting the latent spatial capabilities of multimodal large language models.
SimpleSeg is a decoder-free method that formulates pixel-level segmentation within Multimodal LLMs (MLLMs) as a direct sequence generation problem. By emitting sequences of normalized 2D point coordinates to delineate object boundaries, SimpleSeg achieves accurate semantic segmentation entirely in the language modeling regime, with no architectural modifications or additional decoder modules. The approach leverages standard autoregressive transformers and exploits the intrinsic spatial capabilities of MLLMs through a two-stage training pipeline involving supervised fine-tuning and reinforcement learning with an intersection-over-union (IoU)-based geometric reward (Song et al., 27 Jan 2026).
1. Sequence-Based Segmentation Formulation
SimpleSeg reframes segmentation by asking the MLLM to generate, in text form, a variable-length sequence of 2D points that describe object boundaries. Given an input image and a grounding query (such as a referring expression or a seed point), the model predicts
$\texttt{[ [x_1, y_1], [x_2, y_2], \dots, [x_V, y_V] ]}$
where each is normalized to lie in by the image width and height . This structure is enforced via a minimal JSON-like grammar:
Tokens are standard UTF-8 characters, so the output remains within the model's text space. Polygonal contours are extracted using algorithms such as Suzuki–Abe, enforced to traverse clockwise, and may be sparsified by a tolerance parameter to balance boundary fidelity against sequence length. The decoded polygon is then filled to yield a segmentation mask.
2. Model Architecture and Input/Output Interface
SimpleSeg operates on unmodified MLLMs, as demonstrated on exemplars such as Qwen2.5-VL-7B and Kimi-VL (a 2.8B parameter Mixture-of-Experts model). No new parameters or task-specific decoder heads are introduced; segmentation is handled entirely by adjusting the autoregressive text output policy of the backbone model. Input queries are handled through a family of "conversion tasks" over the 4-tuple , with prompts such as:
- “Q: What is the bounding box of <text>? A: <bbox>.”
- “Q: Give the polygon of the object at <point>? A: <mask>.”
During inference, image patch embeddings and the grounding query are prepended, and the model emits the coordinate sequence as output. Tokens are parsed into polygons, with sequence termination enforced by the final closing bracket. The number of points adapts to object complexity and is controlled by the sparsification tolerance . Invalid token outputs (e.g., broken formatting) yield a zero "format reward" during RL-based training.
3. Supervised Fine-tuning and Reinforcement Learning Pipeline
The core training pipeline consists of two sequential stages:
Stage I: Supervised Fine-Tuning (SFT)
- Training pairs of (image + query) and ground-truth polygons extracted from datasets such as RefCOCO, RefCOCO+, RefCOCOg, and RefCLEF (≈800K samples).
- Optimization target is standard token-level cross-entropy:
- SFT teaches the grammar, object grounding, polygon closure, and traversal order.
Stage II: Reinforcement Learning (RL) with GSPO
- Polygon outputs being many-to-one with respect to masks, RL is used to optimize geometry-aware metrics.
- The policy is fine-tuned from SFT parameters.
- The total reward includes:
Mask IoU reward:
truncated to , and zero if below a threshold ;
- Centroid distance reward (normalized MSE);
- Format reward (1 if the decoded polygon is valid JSON, 0 otherwise).
- The RL objective with KL penalty (to prevent policy drift) is:
- Policy gradient update:
- The Group-structured Proximal Optimization (GSPO) algorithm, a variant of PPO, is employed for optimization.
- Empirically, the RL stage yields an increase of 9–10 gIoU points over SFT alone.
RL Training Pseudocode
1 2 3 4 5 6 7 8 9 10 |
initialize θ ← θ_SFT for epoch in 1…N_rl: for batch in sample_batches(data): # Sample K trajectories y_i ∼ π_θ(·|x) for i in 1...K # Compute the batch of rewards Compute R_i for each y_i # Compute gradient and update ∇θ J_RL ← policy_gradient(R_i, π_θ) θ ← θ + optimizer_step(∇θ J_RL) |
4. Experimental Design and Quantitative Results
SimpleSeg is validated on the referring expression segmentation benchmarks RefCOCO, RefCOCO+, and RefCOCOg, with cIoU (mean IoU over all referring instances) and [email protected] (proportion with IoU ≥ 0.5) as metrics. Experimental parameters include:
- SFT: 1 epoch over 800K samples, learning rate , batch size 256.
- RL: GSPO 2 epochs over 400K samples, learning rate , clip , KL regularization .
- Sparsification tolerance tuned for tokens per polygon.
Key comparative results:
| Method | cIoU (avg) | REC [email protected] |
|---|---|---|
| Decoder-free baseline | 71.4 | — |
| SimpleSeg (SFT+RL, no pre-train) | 73.6 | — |
| SimpleSeg (SFT+RL, pre-trained) | 74.8 | 87.2 |
SimpleSeg achieves cIoU and [email protected] comparable to, and frequently exceeding, decoder-based approaches, despite using zero extra parameters beyond the MLLM backbone.
5. Qualitative Assessment and Failure Analysis
SimpleSeg exhibits high mask fidelity, producing crisp, structurally faithful boundaries on a wide range of visual domains, including photographs, infographics, cartoons, and scientific diagrams. However, limitations include:
- Highly convoluted shapes with holes: As object complexity increases, the required number of tokens for adequate contour sampling can exceed the model's token budget under fixed .
- Sparsification effects: Thin structures and sharp corners may be inadequately captured when polygonal sampling under-represents curvature detail.
- Format errors: Outputting an invalid polygon sequence (e.g., malformed JSON), leading to failure under the RL reward scheme.
Suggested further diagnostics include boundary F₁, vertex Chamfer distance, and token-per-mask rates for deeper evaluation of segmentation quality and efficiency.
6. Implications and Extensions
SimpleSeg demonstrates that standard MLLMs possess latent low-level spatial understanding, previously assumed to require dedicated vision modules. Its success with zero architectural modification challenges the necessity of task-specific decoders for pixel-accurate segmentation tasks.
Extensions facilitated by the unified point-sequence interface include:
- Panoptic segmentation by outputting multiple polygons.
- Point-to-mask (“SAM-style”) and box-to-mask conversions with a unified model prompt and output schema.
- Applications in part segmentation, interactive segmentation, graphical user interface grounding, and image annotation.
Potential future work includes integrating point sequence generation as a pre-training objective, exploring alternative sequence-level metrics (e.g., boundary F₁, token-cost penalties), and scaling up model size and dataset diversity to further enhance spatial grounding and mask fidelity within the language modeling paradigm.
7. Summary
By representing segmentation masks as normalized point sequences, employing SFT to teach the model the polygon grammar and global structure, and refining with an IoU-driven RL stage, SimpleSeg enables accurate, decoder-free, pixel-level segmentation entirely within modern vision-LLMs (Song et al., 27 Jan 2026). This approach establishes a new paradigm for unifying spatial and linguistic understanding in MLLMs, revealing strong emergent spatial perception without additional architectural complexity.