End-to-End Vision Tokenizer Tuning

Published 15 May 2025 in cs.CV | (2505.10562v1)

Abstract: Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed LLMs. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.

Abstract PDF Upgrade to Chat

Summary

The paper’s main contribution is a novel end-to-end method that jointly tunes vision tokenizers and autoregressive LLMs via continuous codebook embeddings.
It employs a lightweight projector and a three-stage training pipeline, balancing caption and VQ reconstruction losses to boost performance by 2-6% on benchmarks.
ETT preserves image reconstruction quality while enhancing text-to-image generation and semantic understanding, achieving competitive results against larger multimodal models.

The paper "End-to-End Vision Tokenizer Tuning" (2505.10562) addresses a key limitation in current autoregressive multimodal models: the decoupled training of vision tokenizers (optimized for low-level image reconstruction) and the downstream LLMs which use the tokenizer's discrete outputs. This decoupling creates a bottleneck, as the tokenizer's reconstruction-focused loss may not yield representations optimal for tasks requiring higher-level semantic understanding or generation.

The core contribution is End-to-End Vision Tokenizer Tuning (ETT), a practical approach that allows the vision tokenizer and the downstream autoregressive LLM to be jointly optimized. Unlike previous methods that only use the discrete indices from a frozen vision tokenizer, ETT leverages the continuous codebook embeddings of the tokenizer.

Here's a breakdown of the ETT approach and its implementation details:

Leveraging Codebook Embeddings: Instead of passing discrete token indices from the vision tokenizer to the LLM, ETT uses the continuous latent embeddings from the tokenizer's codebook lookup. This makes the entire pipeline, from the @@@@3@@@@ through the quantizer's codebook lookup to the LLM, differentiable.
Projector Module: A simple, lightweight Multilayer Perceptron (MLP) with a GeLU activation acts as a projector. It maps the dimension of the vision tokenizer's codebook embeddings ( $D$ ) to the hidden dimension size of the pre-trained LLM ( $C$ ). The projected visual embeddings ( $x^I$ ) and the text token embeddings ( $x^T$ ) from the LLM's text embedding layer are then fed into the LLM.
Joint Optimization Objective: ETT optimizes the vision tokenizer (encoder, decoder, and codebook) and the LLM jointly. The loss function combines two objectives:
- Caption Loss ( $\mathcal{L}_{cap}$ ): A standard cross-entropy loss applied to the LLM's output for generating text tokens given the interleaved visual and text inputs. This encourages the visual tokens to carry semantic information relevant to the text.
- VQ Reconstruction Loss ( $\mathcal{L}_{vq}$ ): The original loss used for training the vision tokenizer, typically including pixel reconstruction, quantization, perceptual, adversarial, and entropy losses. This ensures the tokenizer maintains its ability to reconstruct images accurately. The combined objective is $\mathcal{L} = \mathcal{L}_{cap} + \alpha \cdot \mathcal{L}_{vq}$ , where $\alpha$ is a tunable weight balancing the two tasks. The paper finds $\alpha=0.25$ to be a good balance.
Three-Stage Training Pipeline: ETT is integrated into a sequential training process for multimodal understanding and generation:
- Stage 1: Alignment Learning: Only the visual projector is trained to align the frozen vision tokenizer embeddings with the frozen LLM's input space using image-to-text captioning data.
- Stage 2: Semantic Learning (ETT): The key stage where the LLM, projector, and vision tokenizer are unfrozen and jointly trained using the combined $\mathcal{L} = \mathcal{L}_{cap} + \alpha \cdot \mathcal{L}_{vq}$ objective. This process adapts the tokenizer's representations based on downstream task feedback while preserving reconstruction.
- Stage 3: Post-Training: The vision tokenizer is frozen again. The projector and LLM are further fine-tuned on diverse instruction-following (for chat) or text-to-image generation tasks to specialize the model.

Practical Implementation Details:

Base Vision Tokenizer: The authors use a VQ-based tokenizer, specifically referencing IBQ (Shi et al., 2024), which supports large codebook sizes and code dimensions (131,072 codes, 256 dimensions in their setup).
LLM: Qwen2.5-1.5B [qwen2.5] is used as the base LLM for the experiments, demonstrating effectiveness even with a relatively small LLM size.
Data: Training utilizes a large multimodal dataset (SOL-recap, 32M image-text pairs) for Stages 1 and 2, supplemented by diverse instruction-following and generation datasets for Stage 3 (LLaVA-OneVision (Li et al., 2024), Infinity-MM (Gu et al., 2024), AI-generated, and filtered web data).
Optimization: Standard Adam optimizer is used across stages with cosine decay learning rate scheduling and a warm-up phase.

Performance and Impact:

Multimodal Understanding: ETT achieves significant performance gains (2-6%) on various understanding benchmarks (like SEED-Bench, TextVQA, MME) compared to baselines using frozen discrete tokens. It demonstrates competitive performance against larger, state-of-the-art continuous encoder-based VLMs, highlighting the efficiency gained by optimizing the tokenizer itself.
Visual Generation: ETT shows competitive text-to-image generation capabilities on benchmarks like GenEval and T2I-CompBench, outperforming several existing discrete token models and achieving results comparable to some diffusion models, despite using a smaller LLM and less generation-specific data. The end-to-end tuning improves the generative quality of the visual tokens.
Reconstruction Preservation: Crucially, ETT maintains the original image reconstruction quality of the vision tokenizer while improving downstream performance. Ablation studies show that including the VQ reconstruction loss during joint tuning is essential for this, preventing performance degradation compared to tuning solely on the captioning loss. The visualization in the paper demonstrates that ETT can even improve details like text rendering in reconstructions, suggesting that semantic feedback from the LLM can refine low-level reconstruction.

Implementation Considerations:

Integrating ETT requires access to the internal components (encoder, codebook, decoder) and gradients of the vision tokenizer, necessitating flexibility in the tokenizer's architecture.
Joint training of the LLM and tokenizer increases the number of trainable parameters and computational cost compared to training only a projector or a smaller adapter module, although the paper uses a relatively small LLM to mitigate this.
Balancing the $\mathcal{L}_{cap}$ and $\mathcal{L}_{vq}$ objectives via the $\alpha$ weight is critical to achieving good performance on both understanding/generation and reconstruction. This weight may require tuning depending on the specific tokenizer and LLM used.

In essence, ETT provides a practical and effective method to bridge the gap between low-level vision tokenization and high-level multimodal tasks by enabling differentiable, end-to-end training, thereby unlocking performance improvements for both understanding and generation within autoregressive multimodal models.

Markdown Report Issue