Query-Guided Multi-View Tokenizer
- The Query-Guided Multi-View Tokenizer is a neural module that conditions token selection on both input signals and queries to generate multiple semantic views.
- It employs specialized architectures such as GRDR and QTSplus, using cross-modal contrastive losses and cross-attention for adaptive multi-view tokenization.
- Empirical results demonstrate notable improvements in retrieval recall and efficiency for text-to-video search and long-video understanding tasks.
A Query-Guided Multi-View Tokenizer is a class of neural module that produces multiple task- or query-adaptive discrete or continuous representations (termed views or token sequences) for each high-dimensional input such as a video. These modules condition token selection or quantization on both the input signal and the downstream text or query, enabling diverse semantic access paths. This alleviates semantic ambiguity, supports efficient retrieval over large corpora, and allows the model to focus computational resources on query-relevant signal. Recent work spans two lines: generative retrieval for text-to-video search and multimodal LLM (MLLM) scaling for long video understanding (Zhao et al., 29 Jan 2026, Li et al., 14 Nov 2025).
1. Core Architecture and Functional Principles
A query-guided multi-view tokenizer ingests an input—typically the output of a vision backbone (such as a Vision Transformer or Vision-LLM)—and produces multiple token sequences or semantic identifiers (IDs) per input, each corresponding to a different possible semantic “view” aligned to distinct query intents. The module’s central innovation is leveraging query supervision or cross-modal information for representation diversification, as opposed to static or data-agnostic tokenization.
GRDR Tokenizer (Text-to-Video Retrieval)
In the GRDR framework for text-to-video retrieval (Zhao et al., 29 Jan 2026), this tokenizer:
- Extracts a global high-level feature from each video.
- Passes to independent sub-encoders (), generating latent vectors : .
- Specializes each for a learned query intent by mining training captions for clusters via k-means, and applying a cross-modal contrastive loss () to align and decoder states associated with each cluster.
- Each is quantized via hierarchical residual quantization into a sequence of discrete semantic ID tokens.
QTSplus (Long-Video MLLMs)
In QTSplus (Li et al., 14 Nov 2025), the tokenizer:
- Receives vision tokens and text query ().
- Uses a cross-attention scorer to compute per-token relevance , then predicts a dynamic top- token budget based on query logic and video complexity.
- Selects the most relevant tokens (using differentiable methods for training and hard Top- gating at inference), and applies a lightweight re-encoder to encode preserved temporal or positional structure.
2. Mathematical Formulation and Training Objectives
GRDR: Multi-View Quantization and End-to-End Co-Training (Zhao et al., 29 Jan 2026)
- Contrastive Alignment (): Encourages to approach its assigned query-cluster decoder state .
- Residual Quantization: For each view , quantize into tokens with codebooks . At each level,
- RQ Loss (): Penalizes codebook and quantization error via stop-gradient and reconstruction objectives.
- Hierarchical Consistency (): Cross-entropy loss ensuring code assignment stability across quantization layers.
- Unified Objective:
Training proceeds layer-wise, using for cross-modal alignment and initializing codebooks via k-means before joint co-training.
QTSplus: Query-Conditioned Token Pruning and Compute Regularization (Li et al., 14 Nov 2025)
- Cross-Attention Scoring: Calculate attention weights , with per token.
- Budget Prediction: Combines mean query embedding, number of vision tokens, maximum relevance, and token score entropy via a small MLP to predict expected retain ratio ; then computes .
- Differentiable Token Selection (Training): Gumbel-Softmax with threshold (solved via Newton's method) enables straight-through gradient flow.
- Loss: Supervised fine-tuning loss (distillation from teacher), plus compute-aware regularization:
3. Multi-View Representation: Semantic Paths and Coverage
Both approaches operationalize “multi-view” as the explicit modeling of multiple, specialized access paths for one input instance:
- GRDR: Each video is mapped to code sequences, each capturing a distinct semantic “intent.” This enables a generative retriever to match each video under multiple potential query perspectives, mitigating ambiguity inherent with single-token assignments. Views are tied to k-means-derived query clusters, which encode recurring search or caption archetypes in the dataset. At inference, trie-constrained decoding ensures only encoded sequences are considered as candidates.
- QTSplus: Introduces a dual concept of “forest” (global, coarse selection of important regions) and “tree” (local, fine temporal ordering within kept tokens). The cross-attention scorer and budget head yield a query-sensitive global coverage, while the re-encoder preserves local ordering and context needed for second-level localization.
A plausible implication is that multi-view tokenization is broadly applicable beyond video, wherever semantic ambiguity from one-to-many mappings (e.g., multi-intent, multi-context) exists.
4. Inference Algorithms and Integration with Downstream Models
GRDR Inference (Zhao et al., 29 Jan 2026)
- Each video’s code sequences are inserted as paths in a trie.
- For a given text query, the decoder performs trie-constrained, auto-regressive decoding: each partial code sequence must correspond to an existing prefix.
- Beam search yields up to candidate paths, which are mapped back to videos, deduplicated, and then reranked with a dense retriever for final selection.
QTSplus Inference (Li et al., 14 Nov 2025)
- For each (query, video) pair, cross-attention and budget modules select the top- most relevant vision tokens.
- The selected tokens, after re-encoding, are concatenated with the text query and fed to a frozen or fine-tuned LLM, which then performs the downstream multimodal task (e.g., answering VQA or temporal localization).
- No modifications are required to the backbone Vision Transformer or LLM, as token selection is self-contained.
5. Empirical Results and Ablation Findings
| Setting | Ablation | R@1 Drop (MSR-VTT/1k) | R@1 Drop (MSR-VTT/10k+) |
|---|---|---|---|
| GRDR | w/o Multi-View | –1.2 | –2.4 |
| w/o | –7.1 | –10.2 |
- Removal of the multi-view design significantly reduces recall, especially as index size grows. Eliminating the contrastive alignment loss () is particularly catastrophic, degrading R@1 by up to –10.2 on MSR-VTT, confirming the necessity of cross-modal and multi-path alignment to achieve dense-retrieval parity (Zhao et al., 29 Jan 2026).
QTSplus achieves:
- ≈89% reduction in vision token count () and ≈28% relative latency reduction (83s → 60s) on an A100 GPU at 600 frames.
- Improvement of +20.49 and +5.63 points for TempCompass direction and order accuracies, respectively, against baseline Qwen2.5-VL on relevant long-video benchmarks (Li et al., 14 Nov 2025).
6. Broader Implications and Generalizations
Query-guided multi-view tokenizers focus computation and memory on query-relevant information, enabling efficient scaling to large-scale retrieval and long-duration video reasoning tasks. The forest + trees abstraction in QTSplus is extensible to spatial, temporal, or camera-view multi-stream modalities by instantiating separate streams and re-encoders per view. The paradigm may generalize to other high-dimensional sequential domains such as audio or point clouds, provided that domain-appropriate cross-attention and re-encoding modules are defined. A plausible implication is that these architectures open the door for streaming or continual-inference applications, as instance-adaptive selection and view specialization can evolve with temporal progression or changing query context (Li et al., 14 Nov 2025).
7. Related Works and Distinctions
GRDR (Zhao et al., 29 Jan 2026) integrates with generative semantic-ID retrievers and triebased decoding for scalable text-to-video retrieval with dense reranking. QTSplus (Li et al., 14 Nov 2025) targets multimodal LLMs for long-video understanding, serving as a general information bottleneck and global-local representation compressor. A unifying distinction, relative to prior static or random tokenization, is the query-supervised tailoring of token selection, which bridges the gap between scalable efficiency and strong recall or comprehension, as shown by substantial empirical gains at drastically lower computational cost.