SAMTok: Scalable Discrete Masking for MLLMs

Updated 24 January 2026

SAMTok is a discrete mask representation paradigm that converts arbitrary segmentation masks into fixed two-token sequences for multi-modal language models.
Its architecture uses a compact VAE with a two-stage residual vector quantizer that achieves ~70% IoU reconstruction fidelity without heavy region-level encoders.
The method enables end-to-end training using cross-entropy and Dice losses, with reinforcement learning enhancements boosting performance on segmentation and VQA benchmarks.

SAMTok is a discrete mask representation paradigm designed to provide scalable, language-native pixel-wise capabilities in multi-modal LLMs (MLLMs). It encodes any arbitrary segmentation mask into a fixed two-token sequence, thereby transforming region-based visual tasks into standard autoregressive language modeling tasks. This approach removes the reliance on region-level encoders, specialized segmentation decoders, and task-specific loss designs, enabling efficient multi-task training and reinforcement learning on vision-language benchmarks without structural modifications to the underlying model (Zhou et al., 22 Jan 2026).

1. Motivation and Precedents

Scaling pixel-wise MLLMs for tasks such as segmentation, region-level question answering, and interactive annotation has been obstructed by the architectural and learning complexities inherent to prior systems. Historically, such models have relied on heavy region-level encoders (e.g., region-of-interest pooling, masked attention), which increase inference costs and mandate joint training with additional parameters. Outputting masks typically requires a model-specific segmentation head and custom loss functions like Dice, IoU, or cross-entropy, thus fragmenting the learning process and complicating deployment across diverse tasks.

Continuous mask decoders inhibit standard discrete-action RL algorithms (such as Proximal Policy Optimization), and text-based mask representations—such as polygonal or run-length encoding—carry significant token overhead, often requiring hundreds of tokens per mask, substantially degrading autoregressive decoding performance. The SAMTok formulation circumvents these obstacles by conceptualizing binary masks as atomic language tokens, each mask distilled to a fixed-length, content-adaptive symbolic code.

2. Architecture and Tokenization Scheme

SAMTok is architecturally organized as a compact variational autoencoder (VAE), comprising a mask encoder, a two-stage residual vector quantizer, and a mask decoder. Both encoder and decoder are initialized from SAM2.

Mask Encoder:

Accepts image $\mathcal{I}\in\mathbb{R}^{H\times W\times3}$ and binary region mask $\mathcal{M}\in\{0,1\}^{H\times W}$ .
Produces latent $\mathbf{z} = f_{\mathrm{enc}}(\mathcal{I}, \mathcal{M})$ by combining a frozen image encoder, a dense SAM prompt encoder, and a mask decoder without final binarization.

Residual Vector Quantization:

Two-step process on a shared codebook $\mathcal{C}\subset\mathbb{R}^d$ of size $K=256$ .
The latent $\mathbf{z}$ is first quantized to $\mathbf{e}_1$ , residual $\mathbf{r}_1$ is then quantized to $\mathbf{e}_2$ :

$\mathbf{e}_1 = \arg\min_{\mathbf{e}\in\mathcal{C}} \|\mathbf{z}-\mathbf{e}\|_2^2,\quad \mathbf{r}_1 = \mathbf{z}-\mathbf{e}_1,\quad \mathbf{e}_2 = \arg\min_{\mathbf{e}\in\mathcal{C}} \|\mathbf{r}_1-\mathbf{e}\|_2^2$

The two indices $\mathcal{M}\in\{0,1\}^{H\times W}$ 0 from codebook are mapped to newly introduced language tokens: $\mathcal{M}\in\{0,1\}^{H\times W}$ 1– $\mathcal{M}\in\{0,1\}^{H\times W}$ 2 for level-1 and $\mathcal{M}\in\{0,1\}^{H\times W}$ 3– $\mathcal{M}\in\{0,1\}^{H\times W}$ 4 for level-2.

Mask Decoder:

Given $\mathcal{M}\in\{0,1\}^{H\times W}$ 5 and $\mathcal{M}\in\{0,1\}^{H\times W}$ 6, reconstructs the mask $\mathcal{M}\in\{0,1\}^{H\times W}$ 7.
Both codes are fused into SAM’s prompt encoder, then decoded to dense logits.

Justification for Two Tokens:

A single 256-element codebook encodes coarse shape, while the residual codebook encodes fine details. Empirically, this two-stage quantization achieves $\mathcal{M}\in\{0,1\}^{H\times W}$ 8 IoU reconstruction on held-out data, balancing reconstruction fidelity and vocabulary expansion.

3. Training Regimen

3.1 Mask Reconstruction Pretraining

SAMTok is pretrained to reconstruct input masks using a combination of cross-entropy and Dice losses on 209 million masks from SAM2 and additional datasets (e.g., ADE20K, Cityscapes, EntitySeg). Training involves freezing the image encoder and fine-tuning only the SAM decoder and quantizer using AdamW (batch size 1024, learning rate $\mathcal{M}\in\{0,1\}^{H\times W}$ 9).

Total loss:

$\mathbf{z} = f_{\mathrm{enc}}(\mathcal{I}, \mathcal{M})$ 0

with codebook commitment loss enforcing code assignment.

3.2 MLLM Fine-Tuning

The QwenVL model family (3B, 4B, 7B) incorporates the expanded vocabulary for mask tokens. Supervised fine-tuning is performed on 5 million multimodal samples spanning region captioning, region-level VQA, panoptic scene graph construction, referring segmentation, grounded conversation, and multi-round interactive segmentation. Task prompts are formatted as $\mathbf{z} = f_{\mathrm{enc}}(\mathcal{I}, \mathcal{M})$ 1, supervised via standard next-token prediction.

3.3 Reinforcement Learning

Discrete tokenization allows direct application of text-based RL. SAMTok introduces an “answer-matching” reward, calculated as:

$\mathbf{z} = f_{\mathrm{enc}}(\mathcal{I}, \mathcal{M})$ 2

where $\mathbf{z} = f_{\mathrm{enc}}(\mathcal{I}, \mathcal{M})$ 3 counts true positive mask-token pairs. Group Relative Policy Optimization (GRPO) is used for policy learning. RL improves GRES benchmark gIoU by $\mathbf{z} = f_{\mathrm{enc}}(\mathcal{I}, \mathcal{M})$ 4 and N-acc by $\mathbf{z} = f_{\mathrm{enc}}(\mathcal{I}, \mathcal{M})$ 5, and GCG AP50 by $\mathbf{z} = f_{\mathrm{enc}}(\mathcal{I}, \mathcal{M})$ 6, Recall by $\mathbf{z} = f_{\mathrm{enc}}(\mathcal{I}, \mathcal{M})$ 7.

4. Downstream Integration and Empirical Performance

4.1 Model Integration

The only architectural change to QwenVL models is expansion of the token embedding matrix (512 mask tokens plus delimiters). No downstream adapters are introduced; mask tokens are spliced into prompts as described above. This permits direct application to mask-understanding and mask-generation tasks in a uniform, text-centric format.

4.2 Quantitative Results

SAMTok-equipped QwenVL models demonstrate state-of-the-art or near-equivalent results on a range of pixel-wise multimodal benchmarks:

Task	QwenVL-SAMTok Result	Previous Best
GCG METEOR	17.2	16.4
GCG CIDEr	54.8	49.5
GCG AP50	38.2	33.2
GCG mIoU	72.6	67.7
MR-RefCOCOg cIoU (Rounds 2–6)	~84%	~76%
GRES Mean gIoU	73.6%	73.9%
RefCOCOg cIoU	~79%	(matches or exceeds)
REC box-acc	92.7%	89.1%
PSG R@20	19.8%	20.6
DLC-Bench Avg	65.6	67.3

These results hold across multi-round segmentation, region-level understanding, text-mask interleaved dialogue, and panoptic scene graph tasks.

4.3 Qualitative Observations

Region references (e.g., $\mathbf{z} = f_{\mathrm{enc}}(\mathcal{I}, \mathcal{M})$ 811-347 $\mathbf{z} = f_{\mathrm{enc}}(\mathcal{I}, \mathcal{M})$ 9) provide unambiguous mask pointers in dialogue and scene graph contexts.
Multi-round segmentation preserves token consistency, enabling persistent region identification across dialogue rounds.
Scene graph construction and region-level tasks employ distinct two-token codes as anchors for linking text, spatial, and segmentation cues.

5. Technical Analysis and Future Considerations

5.1 Tokenization Trade-Offs

The RQ (Residual Vector Quantizer) design with $\mathcal{C}\subset\mathbb{R}^d$ 0, $\mathcal{C}\subset\mathbb{R}^d$ 1 yields a favorable trade-off, matching FSQ(65536) IoU with only 512 tokens. Increasing codebook size or quantization steps slightly improves IoU but exponentially increases token search space, impeding model learning and decoding.

5.2 Limitations

SAMTok is currently restricted to still-image masks; temporal mask quantization for video remains an open area.
Prompts for non-mask visual primitives (points, boxes, lines) are not yet tokenized under the same schema.
Fixed two-token capacity may limit fidelity for highly complex or detailed masks; an adaptive-length coding scheme could alleviate this.

5.3 Potential Extensions

Hybrid quantizers with entropy-based pruning for more dynamic code utilization.
Subword-like compositional tokens to represent complex or novel shapes.
Domain adaptation via joint end-to-end fine-tuning of SAMTok and MLLM for specialized fields such as medical imaging or remote sensing.

6. Concluding Appraisal

SAMTok represents a unified, text-centric approach for integrating pixel-wise reasoning into multi-modal LLMs. Its fixed two-token representation of arbitrary masks enables efficient, architecture-agnostic extension to a spectrum of region-based vision-language tasks. Discrete tokenization not only simplifies training and inference but also makes such tasks amenable to standard Language Modeling and RL techniques, resulting in benchmark-matching or surpassing performance. This paradigm foregrounds a scalable pathway toward language-native, pixel-level multimodal intelligence (Zhou et al., 22 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SAMTok: Representing Any Mask with Two Words (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SAMTok.