Papers
Topics
Authors
Recent
Search
2000 character limit reached

Native Visual Tokenization

Updated 6 February 2026
  • Native visual tokenization is a technique that transforms visual inputs into discrete, semantically interpretable tokens that align with linguistic structures.
  • Methodologies such as region-centric extraction, adaptive token length control, and codebook quantization enhance processing efficiency and task performance.
  • Applications span vision–language understanding, autoregressive generation, and multimodal transfer learning, promising more robust unified models.

Native visual tokenization refers to methodologies that convert visual (or multimodal) input into discrete representation units—tokens—that are directly consumed by transformer-based models or multimodal LLMs (MLLMs), analogous to subwords in natural language processing. The aim is not only to interface images (or "visual-text") with generative and discriminative architectures, but also to capture meaningful, structured, and semantically interpretable units, improving efficiency, expressivity, and alignment with linguistic or symbolic tokens. Techniques under this paradigm span vision-centric text reading, object-centric image slotting, region-level visual grounding, information-theoretic compression, adaptive content-based length control, factorized codebook quantization, and lattice-based vector quantization. Native visual tokenization seeks to bridge the gap between vision and language, offering discrete, contextually relevant, and computationally efficient units that enhance both understanding and generation tasks across modalities.

1. Conceptual Foundations of Native Visual Tokenization

Conventional visual tokenization has relied on uniformly sized, fixed-grid patch extraction (e.g., ViT's 16×1616 \times 16 patches), leading to tokens that may mix disparate semantics, ignore object boundaries, or frustrate hierarchical linguistic mapping (Lew et al., 2024). In language, tokenization is intrinsically tied to semantic boundaries (words, subwords) and avoids splitting across logical units. Native visual tokenization seeks to replicate this semantic integrity by creating tokens with minimal concept entanglement and maximal interpretability.

Motivations for this approach include:

2. Methodological Taxonomy and Architectures

Native visual tokenization encompasses a broad spectrum of methods, including the following design paradigms:

a) Vision-Centric Text Tokenization

SeeTok (Xing et al., 21 Oct 2025) replaces subword segmentation for text by rendering the input string as an image and then passing it through a vision encoder. After patch extraction and MLP-based aggregation, the resulting visual tokens replace text tokens in the transformer input, allowing models to "read" text as visual objects. This approach leverages OCR priors in the vision backbone and demonstrates that natural language understanding can be driven by visual representation alone.

b) Object / Region-Centric Tokenization

Object-centric methods use specialized attention mechanisms to carve images into discrete tokens aligned with salient regions:

  • Slot-MLLM (Chi et al., 23 May 2025) employs Slot Attention, where learnable slot vectors compete to attend to different regions, producing a set of object-level embeddings that are discretized via residual VQ and autoregressively integrated into the LLM token stream.
  • Groma (Ma et al., 2024) uses a DETR-style region proposal network to generate explicit ROI tokens that encode both regional appearance and spatial context, enabling precise region-level grounding.

c) Adaptive and Content-Aware Tokenization

Approaches such as Soft Tail-dropping Adaptive Tokenizer (STAT) (Chen et al., 20 Jan 2026) and Differentiable Hierarchical Visual Tokenization (Aasan et al., 4 Nov 2025) introduce mechanisms to adapt token sequence length and granularity to image complexity:

  • STAT predicts a per-token "keep probability," dropping less informative tokens to yield variable-length sequences aligned with structural detail.
  • Hierarchical tokenization creates a superpixel hierarchy, using differentiable model selection (e.g., Akaike or Bayesian Information Criterion) to adaptively select partitions.

d) Codebook Factorization and Lattice-Based Quantization

Scalability and semantic diversity in tokens are addressed via:

  • Factorized Quantization (FQ) (Bai et al., 2024): Decomposes the codebook into multiple independent sub-codebooks, combined with disentanglement and semantic richness regularization, thereby achieving a much larger effective vocabulary with reduced redundancy.
  • Spherical Leech Quantization (Zhao et al., 16 Dec 2025): Employs dense, high-dimensional lattice codes (e.g., Leech lattice Λ24\Lambda_{24}) for lookup-free, maximally dispersive discrete tokenization, providing improved rate–distortion tradeoffs.

e) Dualistic and Hybrid Encodings

CDD-VT (Chen et al., 3 Nov 2025) proposes a continuous–discrete dualistic tokenizer, adapting the number of codebook atoms per patch according to local complexity—interpolating between "particle"-like single-token assignment and "wave"-like soft combinations, dynamically fusing continuous and discrete representations.

f) Native Resolution Encoding

NativeRes-LLaVA (Niu et al., 15 Jun 2025) dispenses with early resizing and processes images at their native resolution and aspect ratio, with variable-length patch tokenization and rotary 2D positional encoding, maintaining high-fidelity spatial content throughout transformer-based vision–language pipelines.

3. Theoretical Underpinnings and Information-Theoretic Perspectives

InfoTok (Tang et al., 2 Feb 2026) reframes visual tokenization in unified MLLMs under the information bottleneck (IB) principle, formalizing the tokenization process as explicitly controlling the mutual information between images, tokens, and downstream multimodal outputs. This framework yields trade-offs between compression (measured by I(Z;I)I(\mathbf{Z};\mathbf{I})) and task relevance (I(Z;YGT)I(\mathbf{Z};\mathbf{Y}^{\mathrm{GT}})), with loss terms that regulate compactness, sufficiency, and cross-modal alignment.

This perspective motivates:

  • Prioritization of reusable structure (object layout, compositional patterns) over inessential high-entropy details.
  • Explicit control of capacity and adaptivity in the tokenization process under constrained computational budgets.

4. Comparative Performance and Quantitative Analysis

Native visual tokenizers have demonstrated significant empirical advances across multiple metrics and tasks:

a) Token Efficiency and Computational Savings

  • SeeTok achieves \sim4.4×\times token reduction and 70.5% FLOP decrease compared to traditional text subword tokenizers on TriviaQA, without sacrificing language modeling accuracy (Xing et al., 21 Oct 2025).
  • STAT reduces autoregressive sequence length and enables efficient AR visual generation, yielding better scaling laws than non-adaptive tokenizers (Chen et al., 20 Jan 2026).

b) Task Performance

  • For object-centric models (Slot-MLLM), multimodal understanding and generation metrics improve markedly: GQA (VQA) accuracy +11%, COCO caption CIDEr +4.3 points, segmentation and localized editing precision surpassing patch-token or VQGAN baselines (Chi et al., 23 May 2025).
  • Groma attains +2.5% accuracy on RefCOCO and +12 AR on LVIS-Ground compared with previous localization schemes (Ma et al., 2024).
  • Superpixel (SuiT) tokenization improves ImageNet accuracy by 2–3pp at equal or lower compute relative to fixed-grid DeiT baselines (Lew et al., 2024).

c) Semantic Structure and Compositionality

5. Advantages, Limitations, and Distinctive Strengths

Advantages

Limitations

  • Style dependency in text rendering: For vision-centric text reading, font, size, or DPI variability can impact encoder sensitivity (Xing et al., 21 Oct 2025).
  • Knowledge transfer: Vision encoders pretrained on images may not possess adequate world-knowledge for pure-text, knowledge-intensive reasoning (Xing et al., 21 Oct 2025).
  • Scalability and memory: Variable-length or adaptive tokenization may require careful engineering for batching, positional encoding, and attention masking (Niu et al., 15 Jun 2025, Chen et al., 20 Jan 2026).

6. Application Domains and Broader Implications

Native visual tokenization directly impacts:

The RC-Bench (Niu et al., 15 Jun 2025) systematically evaluates resolution- and aspect-ratio robustness, demonstrating the necessity of native-resolution tokenization for high-stakes, open-world deployment.

7. Future Directions

Ongoing research emphasizes:

  • Joint vision-language pretraining: Co-training vision encoders and LLMs from scratch on visual tokens and natural images/texts (Xing et al., 21 Oct 2025).
  • Adaptive granularity and learnable region proposals: Developing architectures for on-the-fly, learnable token boundaries and hierarchical region selection (Lew et al., 2024, Aasan et al., 4 Nov 2025).
  • End-to-end differentiable optimization: Fully integrating tokenization, compression, and generation in a single learning loop, including vectorization and raster-to-vector conversion (Aasan et al., 4 Nov 2025).
  • Extension to new modalities: Applying native tokenization concepts to speech (visualized spectra), video (supervoxels), or 3D data (Xing et al., 21 Oct 2025, Niu et al., 15 Jun 2025).
  • Information-theoretic advances: Stricter realization of mutual information and rate–distortion regularization within adaptive visual tokenizers (Tang et al., 2 Feb 2026).

Native visual tokenization represents a shift toward more semantically aligned, efficient, and human-like representations within deep multimodal architectures, enabling robust, unified reasoning and generation across the language–vision spectrum.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Native Visual Tokenization.