Unified Multimodal Token Approaches

Updated 20 January 2026

Unified multimodal token approaches are frameworks that convert diverse modalities into unified token sequences, enabling shared processing and next-token prediction.
They employ discrete, continuous, and hybrid tokenization strategies with advanced codebooks to optimize tasks like image captioning, editing, and visual question answering.
These approaches enhance model efficiency and performance by unifying various data sources and reducing modality-specific complexity through a single, sequence-based paradigm.

Unified multimodal token approaches denote a class of architectures, tokenization strategies, and training regimes that cast different modalities—such as text, images, audio, video, 3D, or motion—into a shared or harmonized token-based space for both multimodal understanding and generation. These frameworks leverage a unified token vocabulary, token codebook, or token interface, enabling large models (e.g., Transformers or MLLMs) to reason over, synthesize, and manipulate heterogeneous modalities using a single, sequence-based computational paradigm. The unification is realized either at the modeling level (one shared backbone), the tokenization level (a joint token vocabulary or compatible codebooks), or both. This has emerged as the dominant paradigm for scalable, data-driven multimodal artificial intelligence.

1. Core Concepts and Taxonomy

Unified multimodal token approaches are characterized by the transformation of various input signals, whether naturally discrete (text, DNA) or continuous (images, audio, video, motion trajectories), into token sequences that can be jointly processed in a single backbone—typically a transformer—under a next-token prediction or generative diffusion objective (Chen et al., 2024, Jiao et al., 6 Apr 2025). The taxonomy includes:

Discrete tokenization: Vector quantization (VQ) or codebook-based quantization of images, video, and audio to produce discrete token IDs compatible with textual tokens (Swerdlow et al., 26 Mar 2025, Zheng et al., 2024).
Continuous tokenization: Retention of continuous embeddings (ViT patch features, acoustic frames) directly input to the transformer, sometimes with adapters for alignment (Jiao et al., 6 Apr 2025, He et al., 14 Oct 2025).
Hybrid codes: Joint encoding of both discrete (for synthesis/editing) and continuous (for high-level semantics or conditioning) tokens (Jiao et al., 6 Apr 2025, Chen et al., 9 Mar 2025).
Unified or harmonized codebook: A codebook or token vocabulary shared among all modalities, or structurally aligned (e.g., with hierarchical guidance) to allow one token space (Zheng et al., 2024, Chen et al., 9 Mar 2025).

Task paradigms include:

Unified next-token prediction: All multimodal tasks (understanding, generation, image captioning, VQA, segmentation, text-to-image, editing) are cast as sequence prediction, enabling shared modeling for understanding and synthesis (Chen et al., 2024, Chen et al., 7 Nov 2025).
Unified inpainting and editing: Models can inpaint or edit tokens in any modality by masking and sequence restoration (Swerdlow et al., 26 Mar 2025, Xu et al., 7 Jan 2026).
Joint multimodal communication: In transmission or collaboration settings, tokens serve as the primary communicative entity across devices (see "token communication") (Wei et al., 2 Jul 2025, Zhang et al., 6 May 2025).

2. Tokenization Strategies and Architectures

2.1 Discrete Codebook and Hybrid Tokenizers

Discrete Tokenizers: Standard approaches use VQ-VAE, VQ-GAN, or derived codebooks (e.g., lookup-free quantization, semantic-guided hierarchical codebooks) to convert images, video, or motion into discrete codes. Each token is an index into a learned codebook, and the transformer operates on a vocabulary that unifies these codes with text (Chen et al., 9 Mar 2025, Zheng et al., 2024). Notably, approaches such as SemHiTok use a two-level codebook: a global semantic codebook and per-semantics pixel sub-codebooks, decoupling high-level understanding from low-level reconstruction and enabling better trade-offs across tasks (Chen et al., 9 Mar 2025).

Hybrid and Stacked Tokenization: UniToken and similar frameworks encode both discrete (for pixel/patch fidelity) and continuous (for semantics) representations, concatenating both into the unified sequence. This separation allows selective attention: continuous tokens drive understanding, discrete tokens govern generation or reconstruction (Jiao et al., 6 Apr 2025).

Unified/Harmonized Codebooks: UniCode and related designs harmonize text and visual modalities by building a single embedding table jointly trained or iteratively synchronized between text and vision tasks, removing the need for modality-specific heads or adapters (Zheng et al., 2024). Stacked quantization schemes (e.g., hierarchical, residual quantization) further compress and unify the visual token stream (Zheng et al., 2024).

Byte-Pair Visual Encoding and Sub-Tokenization: Applying byte-pair encoding (BPE) to visual tokens (as in "Unified Multimodal Understanding via Byte-Pair Visual Encoding") constructs a visually structured vocabulary, supporting scalability, token efficiency, and transformer-compatibility akin to LLMs (Zhang et al., 30 Jun 2025).

2.2 Proxy Tokens and Robustness

Cross-Modal Proxy Tokens (CMPTs): Proxy tokens are learned vectors that stand in for a missing modality's class token, synthesized on-the-fly by attending to available tokens and jointly trained to approximate the missing class token. This enables architecture robustness to missing modalities at inference without imputation networks (Reza et al., 29 Jan 2025).

2.3 Universal Token Backbones

Full-Backbone Unification and 4D Tokenization: Architectures such as AToken (Lu et al., 17 Sep 2025) and Meta-Transformer (Zhang et al., 2023) tokenize images, video, and 3D into a unified spatial-temporal (or higher-dimensional) latent token space, enabling a pure transformer to process any visual or multimodal input without task-specific branches.

Token Communication Paradigms: Approaches like UniToCom and UniMIC define the entire inter-device/model communication protocol via tokens—using generative information bottleneck objectives for efficient, modality-agnostic token learning and causal transformer decoding for all tasks (Wei et al., 2 Jul 2025, Mao et al., 26 Sep 2025, Zhang et al., 6 May 2025).

3. Unified Modeling and Next-Token Paradigms

The core modeling advancement is the adoption of unified next-token prediction (NTP) or autoregressive sequence generation for all modalities and tasks (Chen et al., 2024, Chen et al., 7 Nov 2025). Key characteristics:

Shared decoding: A single transformer stack predicts the next token across modalities, with outputs parameterized over a unified vocabulary or token-type-aware linear head. This enables seamless interleaving (e.g., text, image, audio) (Jiao et al., 6 Apr 2025, Su et al., 2 Oct 2025).
Cross-modal sequence construction: All tokens—textual, visual, auditory—are concatenated (potentially with delimiters or explicit modality tags) in the input stream; downstream loss functions (classification, segmentation, captioning) map to next-token loss frameworks (Chen et al., 7 Nov 2025).
Discretized diffusion and absorbing models: In place of strictly causal AR generation, some frameworks apply discrete diffusion (masking) over the entire multimodal token sequence, unifying bidirectional context and global generative capabilities (Swerdlow et al., 26 Mar 2025, Xu et al., 7 Jan 2026).

Some systems augment the next-token objective with specialized training innovations, such as:

Next-k token prediction (NkTP) and focal-weighted cross-entropy (for sequence error correction) (Chen et al., 7 Nov 2025).
Token-level contrastive learning and hard error token tracking (Chen et al., 7 Nov 2025).
Robust per-token cross-entropy for dynamic masking and generalization (Su et al., 2 Oct 2025).
Variable-rate masking and stochastic mixed-modal transport for modality alignment (Xu et al., 7 Jan 2026).

4. Practical Impact and Empirical Results

Unified multimodal token approaches have demonstrated state-of-the-art or competitive performance in:

Vision-language understanding and VQA: Models such as FLUID (Cuong et al., 10 Aug 2025), PaDT (Su et al., 2 Oct 2025), and EMMA (He et al., 4 Dec 2025) achieve >90% accuracy on large-scale datasets (GLAMI-1M, RefCOCO, MMMU), and outperform prompt- or attention-fusion methods in missing/modality-noise scenarios (Reza et al., 29 Jan 2025).
Generation and editing tasks: Approaches supporting joint text/image generation/editing (e.g., PaDT, UniDisc, CoM-DAD) reveal strong FID, CLIP, and CIDEr scores, and enable non-autoregressive or parallel generation with speed and controllability benefits (Swerdlow et al., 26 Mar 2025, Su et al., 2 Oct 2025, Xu et al., 7 Jan 2026).
Robust communication and compression: Token-based interactive frameworks achieve ultra-low bitrate transmission with no loss in downstream VQA or T2I quality (e.g., UniMIC: 0.0296 bpp, FID=80.61, POPE Acc=0.7710) (Mao et al., 26 Sep 2025), and token communication-based approaches deliver up to 13.7% accuracy gain under SNR constraints (Zhang et al., 6 May 2025).
Cross-modal retrieval, segmentation, and motion synthesis: Unified token models set records on retrieval (joint retrieval=0.64 (Swerdlow et al., 26 Mar 2025)), image segmentation (Dice=91.10% (Chen et al., 7 Nov 2025)), and multi-part motion tasks (R-Precision, FID, ID-consistency (Zhou et al., 2023)).
Model and compute efficiency: High compression factors (EMMA: 32×), channelwise fusion, and task-aware pruning result in 15–40% reduction in FLOP or memory at fixed performance, with strong scaling properties as token vocabulary grows (He et al., 4 Dec 2025, Mao et al., 10 Feb 2025, Zhang et al., 30 Jun 2025).

5. Technical Innovations and Design Patterns

Major design themes across successful unified token approaches include:

Hierarchical or hybrid codebook design: Semantic-prioritized codebooks with per-semantic pixel sub-codebooks (SemHiTok, AToken, EMMA) enable simultaneous semantic fidelity and reconstruction (Chen et al., 9 Mar 2025, Lu et al., 17 Sep 2025, He et al., 4 Dec 2025).
Task-aware dynamic proxies & routers: Cross-modal proxy tokens (CMPT), mixture-of-depths pruning (UniMoD), and dynamic token interleaving adaptively select or generate token representations depending on task requirements (Reza et al., 29 Jan 2025, Mao et al., 10 Feb 2025).
Unified, expandable token vocabulary: Dynamic embedding tables (PaDT) and token expansion via BPE or scenario-driven merges support scalable, efficient addition of modalities or tasks (Su et al., 2 Oct 2025, Zhang et al., 30 Jun 2025).
Shared-vs-decoupled model stacks: Shared shallow transformer layers across understanding/generation, followed by task-specific deep heads (EMMA), promote multi-task transfer while reducing harmful interference (He et al., 4 Dec 2025).
Diffusion-based and absorbing unified modeling: Joint continuous (semantic) and discrete (token) absorbing diffusion (CoM-DAD) or discrete diffusion over a single code-based vocabulary (UniDisc) offer new trade-offs in controllability, global context integration, and inference efficiency (Xu et al., 7 Jan 2026, Swerdlow et al., 26 Mar 2025).

6. Open Challenges and Future Directions

The literature identifies several unresolved or emerging issues:

Scaling laws: Although data and model size scaling appears beneficial, exact scaling exponents for different modalities and tasks within unified token frameworks remain open (Chen et al., 2024).
Modality interference and negative transfer: Simultaneous optimization for substantially different modalities or tasks leads to cross-modal interference; more robust fusion, dynamic token weighting, or masking may be required (Chen et al., 2024, He et al., 4 Dec 2025).
Token efficiency and sequence pruning: Redundant tokens—especially from visual or audio modalities—result in excessive compute; adaptive fusion and pruning mechanisms (e.g., token merging, MoD) are promising directions (Mao et al., 10 Feb 2025, He et al., 4 Dec 2025).
Extending unification to additional modalities: Application to 3D, robotics actions, protein sequences, and haptics requires new tokenization strategies and hierarchical codebook designs (Chen et al., 2024, Lu et al., 17 Sep 2025).
Joint communication/AI optimization: Token-based interactive protocols for distributed/federated or wireless collaborative models remain a growing area, especially as compression and error resilience requirements intensify (Wei et al., 2 Jul 2025, Mao et al., 26 Sep 2025, Zhang et al., 6 May 2025).
Non-autoregressive and generative interface: Integration of NTP with bidirectional or diffusion-based generative processes (e.g., MaskGIT, coupled SDEs) may further accelerate inference and improve sample coherence (Xu et al., 7 Jan 2026, Swerdlow et al., 26 Mar 2025).

7. Comparative Summary of Key Approaches

Approach	Tokenization	Codebook/Interface	Modeling	Unique Features
SemHiTok (Chen et al., 9 Mar 2025)	Discrete	Semantic-guided hierarchical codebook	AR Transformer	Decoupled semantic/pixel
UniToken (Jiao et al., 6 Apr 2025)	Discrete + Cont	VQ + continuous ViT	AR Transformer	Hybrid understanding/generation
PaDT (Su et al., 2 Oct 2025)	Discrete	Dynamic per-image patch tokens	AR LLM + head	Per-image VRTs, interleaved output
CMPT (Reza et al., 29 Jan 2025)	Proxy tokens	Task-aligned, learnable proxies	Fusion net	Robustness to missing modalities
CoM-DAD (Xu et al., 7 Jan 2026)	Discrete+Cont	Absorbing diffusion, coupled SDE	Diffusion	Joint continuous planning, discrete denoise
UniDisc (Swerdlow et al., 26 Mar 2025)	Discrete	Lookup-free quantization, joint vocab	Discrete diffusion	Inpainting, NAR, AR, editing
EMMA (He et al., 4 Dec 2025)	Discrete	32× token compression, channel fusion	Shared + task decoupled	MoE encoding, efficient token count
Meta-Transformer (Zhang et al., 2023)	Discrete/Cont	Linear projection/tokenizers	Frozen ViT	Supports 12 modalities, no paired data
UniToCom (Wei et al., 2 Jul 2025)	Discrete/Cont	GenIB bottleneck	Causal Transformer	Unified comm/proc/GenIB σ-stabilized

This table captures essential contrasts: from hierarchical and hybrid tokenization, to robustness strategies (proxies), to continuous-discrete blending, to explicit communication-driven frameworks. Empirical results routinely demonstrate high accuracy, generation quality, and efficiency across these designs.

These advances in unified multimodal token approaches indicate the emergence of a robust, generalizable, and efficient paradigm for next-generation multimodal AI systems, supporting scalable understanding, generation, and inter-device/model communication across arbitrary modalities.