SAMTok: Scalable Discrete Masking for MLLMs
- SAMTok is a discrete mask representation paradigm that converts arbitrary segmentation masks into fixed two-token sequences for multi-modal language models.
- Its architecture uses a compact VAE with a two-stage residual vector quantizer that achieves ~70% IoU reconstruction fidelity without heavy region-level encoders.
- The method enables end-to-end training using cross-entropy and Dice losses, with reinforcement learning enhancements boosting performance on segmentation and VQA benchmarks.
SAMTok is a discrete mask representation paradigm designed to provide scalable, language-native pixel-wise capabilities in multi-modal LLMs (MLLMs). It encodes any arbitrary segmentation mask into a fixed two-token sequence, thereby transforming region-based visual tasks into standard autoregressive language modeling tasks. This approach removes the reliance on region-level encoders, specialized segmentation decoders, and task-specific loss designs, enabling efficient multi-task training and reinforcement learning on vision-language benchmarks without structural modifications to the underlying model (Zhou et al., 22 Jan 2026).
1. Motivation and Precedents
Scaling pixel-wise MLLMs for tasks such as segmentation, region-level question answering, and interactive annotation has been obstructed by the architectural and learning complexities inherent to prior systems. Historically, such models have relied on heavy region-level encoders (e.g., region-of-interest pooling, masked attention), which increase inference costs and mandate joint training with additional parameters. Outputting masks typically requires a model-specific segmentation head and custom loss functions like Dice, IoU, or cross-entropy, thus fragmenting the learning process and complicating deployment across diverse tasks.
Continuous mask decoders inhibit standard discrete-action RL algorithms (such as Proximal Policy Optimization), and text-based mask representations—such as polygonal or run-length encoding—carry significant token overhead, often requiring hundreds of tokens per mask, substantially degrading autoregressive decoding performance. The SAMTok formulation circumvents these obstacles by conceptualizing binary masks as atomic language tokens, each mask distilled to a fixed-length, content-adaptive symbolic code.
2. Architecture and Tokenization Scheme
SAMTok is architecturally organized as a compact variational autoencoder (VAE), comprising a mask encoder, a two-stage residual vector quantizer, and a mask decoder. Both encoder and decoder are initialized from SAM2.
Mask Encoder:
- Accepts image and binary region mask .
- Produces latent by combining a frozen image encoder, a dense SAM prompt encoder, and a mask decoder without final binarization.
Residual Vector Quantization:
- Two-step process on a shared codebook of size .
- The latent is first quantized to , residual is then quantized to :
- The two indices from codebook are mapped to newly introduced language tokens: – for level-1 and – for level-2.
Mask Decoder:
- Given and , reconstructs the mask .
- Both codes are fused into SAM’s prompt encoder, then decoded to dense logits.
Justification for Two Tokens:
A single 256-element codebook encodes coarse shape, while the residual codebook encodes fine details. Empirically, this two-stage quantization achieves IoU reconstruction on held-out data, balancing reconstruction fidelity and vocabulary expansion.
3. Training Regimen
3.1 Mask Reconstruction Pretraining
SAMTok is pretrained to reconstruct input masks using a combination of cross-entropy and Dice losses on 209 million masks from SAM2 and additional datasets (e.g., ADE20K, Cityscapes, EntitySeg). Training involves freezing the image encoder and fine-tuning only the SAM decoder and quantizer using AdamW (batch size 1024, learning rate ).
Total loss:
with codebook commitment loss enforcing code assignment.
3.2 MLLM Fine-Tuning
The QwenVL model family (3B, 4B, 7B) incorporates the expanded vocabulary for mask tokens. Supervised fine-tuning is performed on 5 million multimodal samples spanning region captioning, region-level VQA, panoptic scene graph construction, referring segmentation, grounded conversation, and multi-round interactive segmentation. Task prompts are formatted as , supervised via standard next-token prediction.
3.3 Reinforcement Learning
Discrete tokenization allows direct application of text-based RL. SAMTok introduces an “answer-matching” reward, calculated as:
where counts true positive mask-token pairs. Group Relative Policy Optimization (GRPO) is used for policy learning. RL improves GRES benchmark gIoU by and N-acc by , and GCG AP50 by , Recall by .
4. Downstream Integration and Empirical Performance
4.1 Model Integration
The only architectural change to QwenVL models is expansion of the token embedding matrix (512 mask tokens plus delimiters). No downstream adapters are introduced; mask tokens are spliced into prompts as described above. This permits direct application to mask-understanding and mask-generation tasks in a uniform, text-centric format.
4.2 Quantitative Results
SAMTok-equipped QwenVL models demonstrate state-of-the-art or near-equivalent results on a range of pixel-wise multimodal benchmarks:
| Task | QwenVL-SAMTok Result | Previous Best |
|---|---|---|
| GCG METEOR | 17.2 | 16.4 |
| GCG CIDEr | 54.8 | 49.5 |
| GCG AP50 | 38.2 | 33.2 |
| GCG mIoU | 72.6 | 67.7 |
| MR-RefCOCOg cIoU (Rounds 2–6) | ~84% | ~76% |
| GRES Mean gIoU | 73.6% | 73.9% |
| RefCOCOg cIoU | ~79% | (matches or exceeds) |
| REC box-acc | 92.7% | 89.1% |
| PSG R@20 | 19.8% | 20.6 |
| DLC-Bench Avg | 65.6 | 67.3 |
These results hold across multi-round segmentation, region-level understanding, text-mask interleaved dialogue, and panoptic scene graph tasks.
4.3 Qualitative Observations
- Region references (e.g., 11-347) provide unambiguous mask pointers in dialogue and scene graph contexts.
- Multi-round segmentation preserves token consistency, enabling persistent region identification across dialogue rounds.
- Scene graph construction and region-level tasks employ distinct two-token codes as anchors for linking text, spatial, and segmentation cues.
5. Technical Analysis and Future Considerations
5.1 Tokenization Trade-Offs
The RQ (Residual Vector Quantizer) design with , yields a favorable trade-off, matching FSQ(65536) IoU with only 512 tokens. Increasing codebook size or quantization steps slightly improves IoU but exponentially increases token search space, impeding model learning and decoding.
5.2 Limitations
- SAMTok is currently restricted to still-image masks; temporal mask quantization for video remains an open area.
- Prompts for non-mask visual primitives (points, boxes, lines) are not yet tokenized under the same schema.
- Fixed two-token capacity may limit fidelity for highly complex or detailed masks; an adaptive-length coding scheme could alleviate this.
5.3 Potential Extensions
- Hybrid quantizers with entropy-based pruning for more dynamic code utilization.
- Subword-like compositional tokens to represent complex or novel shapes.
- Domain adaptation via joint end-to-end fine-tuning of SAMTok and MLLM for specialized fields such as medical imaging or remote sensing.
6. Concluding Appraisal
SAMTok represents a unified, text-centric approach for integrating pixel-wise reasoning into multi-modal LLMs. Its fixed two-token representation of arbitrary masks enables efficient, architecture-agnostic extension to a spectrum of region-based vision-language tasks. Discrete tokenization not only simplifies training and inference but also makes such tasks amenable to standard Language Modeling and RL techniques, resulting in benchmark-matching or surpassing performance. This paradigm foregrounds a scalable pathway toward language-native, pixel-level multimodal intelligence (Zhou et al., 22 Jan 2026).