Papers
Topics
Authors
Recent
Search
2000 character limit reached

Query-Guided Multi-View Tokenizer

Updated 5 February 2026
  • The Query-Guided Multi-View Tokenizer is a neural module that conditions token selection on both input signals and queries to generate multiple semantic views.
  • It employs specialized architectures such as GRDR and QTSplus, using cross-modal contrastive losses and cross-attention for adaptive multi-view tokenization.
  • Empirical results demonstrate notable improvements in retrieval recall and efficiency for text-to-video search and long-video understanding tasks.

A Query-Guided Multi-View Tokenizer is a class of neural module that produces multiple task- or query-adaptive discrete or continuous representations (termed views or token sequences) for each high-dimensional input such as a video. These modules condition token selection or quantization on both the input signal and the downstream text or query, enabling diverse semantic access paths. This alleviates semantic ambiguity, supports efficient retrieval over large corpora, and allows the model to focus computational resources on query-relevant signal. Recent work spans two lines: generative retrieval for text-to-video search and multimodal LLM (MLLM) scaling for long video understanding (Zhao et al., 29 Jan 2026, Li et al., 14 Nov 2025).

1. Core Architecture and Functional Principles

A query-guided multi-view tokenizer ingests an input—typically the output of a vision backbone (such as a Vision Transformer or Vision-LLM)—and produces multiple token sequences or semantic identifiers (IDs) per input, each corresponding to a different possible semantic “view” aligned to distinct query intents. The module’s central innovation is leveraging query supervision or cross-modal information for representation diversification, as opposed to static or data-agnostic tokenization.

GRDR Tokenizer (Text-to-Video Retrieval)

In the GRDR framework for text-to-video retrieval (Zhao et al., 29 Jan 2026), this tokenizer:

  • Extracts a global high-level feature fvRdff_v \in \mathbb{R}^{d_f} from each video.
  • Passes fvf_v to NvN_v independent sub-encoders (ϕi\phi_i), generating NvN_v latent vectors (zi)(z_i): zi=ϕi(fv)z_i = \phi_i(f_v).
  • Specializes each ϕi\phi_i for a learned query intent by mining training captions for K=NvK = N_v clusters via k-means, and applying a cross-modal contrastive loss (LCLL_{CL}) to align ziz_i and decoder states associated with each cluster.
  • Each ziz_i is quantized via hierarchical residual quantization into a sequence of MM discrete semantic ID tokens.

QTSplus (Long-Video MLLMs)

In QTSplus (Li et al., 14 Nov 2025), the tokenizer:

  • Receives vision tokens XRM×dX \in \mathbb{R}^{M \times d} and text query QQ (RL×d\mathbb{R}^{L \times d}).
  • Uses a cross-attention scorer to compute per-token relevance rr, then predicts a dynamic top-nn token budget based on query logic and video complexity.
  • Selects the most relevant tokens (using differentiable methods for training and hard Top-nn gating at inference), and applies a lightweight re-encoder to encode preserved temporal or positional structure.

2. Mathematical Formulation and Training Objectives

  • Contrastive Alignment (LCLL_{CL}): Encourages ziz_i to approach its assigned query-cluster decoder state h(m)h^{(m)}.
  • Residual Quantization: For each view ii, quantize ziz_i into MM tokens (ci(1:M))(c_i^{(1:M)}) with codebooks C(m)C^{(m)}. At each level,

ri(1)=zi, ri(m)=zil=1m1eci(l)(l)r_i^{(1)} = z_i,~ r_i^{(m)} = z_i - \sum_{l=1}^{m-1} e^{(l)}_{c_i^{(l)}}

ci(m)=argmaxkcos(ri(m),ek(m))c_i^{(m)} = \underset{k}{\arg\max} \cos(r_i^{(m)}, e^{(m)}_k)

  • RQ Loss (LRQL_{RQ}): Penalizes codebook and quantization error via stop-gradient and reconstruction objectives.
  • Hierarchical Consistency (LHCL_{HC}): Cross-entropy loss ensuring code assignment stability across quantization layers.
  • Unified Objective:

Ltotal=λ1LCE+λ2LHC+λ3LRQ+λ4LRecL_{\text{total}} = \lambda_1 L_{CE} + \lambda_2 L_{HC} + \lambda_3 L_{RQ} + \lambda_4 L_{Rec}

Training proceeds layer-wise, using LCLL_{CL} for cross-modal alignment and initializing codebooks via k-means before joint co-training.

  • Cross-Attention Scoring: Calculate attention weights αRh×L×M\alpha \in \mathbb{R}^{h \times L \times M}, with ri=maxh,αh,,ir_i = \max_{h, \ell} \alpha_{h, \ell, i} per token.
  • Budget Prediction: Combines mean query embedding, number of vision tokens, maximum relevance, and token score entropy via a small MLP to predict expected retain ratio ρ\rho; then computes n=min(ρM,nmax)n = \min(\lceil \rho M \rceil, n_{\text{max}}).
  • Differentiable Token Selection (Training): Gumbel-Softmax with threshold tt (solved via Newton's method) enables straight-through gradient flow.
  • Loss: Supervised fine-tuning loss (distillation from teacher), plus compute-aware regularization:

Ltotal=E(V,Q,y)[LSFT(Q,X,y)+λt(ρM)2/nmax2+λm(ρM)/nmax+λs(ρρˉ)2]L_{\text{total}} = \mathbb{E}_{(V,Q,y)}\big[ L_{\text{SFT}}(Q, X', y) + \lambda_t (\rho M )^2 / n_{\text{max}}^2 + \lambda_m (\rho M) / n_{\text{max}} + \lambda_s (\rho - \bar{\rho})^2 \big]

3. Multi-View Representation: Semantic Paths and Coverage

Both approaches operationalize “multi-view” as the explicit modeling of multiple, specialized access paths for one input instance:

  • GRDR: Each video is mapped to NvN_v code sequences, each capturing a distinct semantic “intent.” This enables a generative retriever to match each video under multiple potential query perspectives, mitigating ambiguity inherent with single-token assignments. Views are tied to k-means-derived query clusters, which encode recurring search or caption archetypes in the dataset. At inference, trie-constrained decoding ensures only encoded sequences are considered as candidates.
  • QTSplus: Introduces a dual concept of “forest” (global, coarse selection of important regions) and “tree” (local, fine temporal ordering within kept tokens). The cross-attention scorer and budget head yield a query-sensitive global coverage, while the re-encoder preserves local ordering and context needed for second-level localization.

A plausible implication is that multi-view tokenization is broadly applicable beyond video, wherever semantic ambiguity from one-to-many mappings (e.g., multi-intent, multi-context) exists.

4. Inference Algorithms and Integration with Downstream Models

  • Each video’s NvN_v code sequences are inserted as paths in a trie.
  • For a given text query, the decoder performs trie-constrained, auto-regressive decoding: each partial code sequence must correspond to an existing prefix.
  • Beam search yields up to B×NvB \times N_v candidate paths, which are mapped back to videos, deduplicated, and then reranked with a dense retriever for final selection.
  • For each (query, video) pair, cross-attention and budget modules select the top-nn most relevant vision tokens.
  • The selected tokens, after re-encoding, are concatenated with the text query and fed to a frozen or fine-tuned LLM, which then performs the downstream multimodal task (e.g., answering VQA or temporal localization).
  • No modifications are required to the backbone Vision Transformer or LLM, as token selection is self-contained.

5. Empirical Results and Ablation Findings

Setting Ablation R@1 Drop (MSR-VTT/1k) R@1 Drop (MSR-VTT/10k+)
GRDR w/o Multi-View –1.2 –2.4
w/o LCLL_{CL} –7.1 –10.2
  • Removal of the multi-view design significantly reduces recall, especially as index size grows. Eliminating the contrastive alignment loss (LCLL_{CL}) is particularly catastrophic, degrading R@1 by up to –10.2 on MSR-VTT, confirming the necessity of cross-modal and multi-path alignment to achieve dense-retrieval parity (Zhao et al., 29 Jan 2026).

QTSplus achieves:

  • ≈89% reduction in vision token count (180k20k180k \to 20k) and ≈28% relative latency reduction (83s → 60s) on an A100 GPU at 600 frames.
  • Improvement of +20.49 and +5.63 points for TempCompass direction and order accuracies, respectively, against baseline Qwen2.5-VL on relevant long-video benchmarks (Li et al., 14 Nov 2025).

6. Broader Implications and Generalizations

Query-guided multi-view tokenizers focus computation and memory on query-relevant information, enabling efficient scaling to large-scale retrieval and long-duration video reasoning tasks. The forest + trees abstraction in QTSplus is extensible to spatial, temporal, or camera-view multi-stream modalities by instantiating separate streams and re-encoders per view. The paradigm may generalize to other high-dimensional sequential domains such as audio or point clouds, provided that domain-appropriate cross-attention and re-encoding modules are defined. A plausible implication is that these architectures open the door for streaming or continual-inference applications, as instance-adaptive selection and view specialization can evolve with temporal progression or changing query context (Li et al., 14 Nov 2025).

GRDR (Zhao et al., 29 Jan 2026) integrates with generative semantic-ID retrievers and triebased decoding for scalable text-to-video retrieval with dense reranking. QTSplus (Li et al., 14 Nov 2025) targets multimodal LLMs for long-video understanding, serving as a general information bottleneck and global-local representation compressor. A unifying distinction, relative to prior static or random tokenization, is the query-supervised tailoring of token selection, which bridges the gap between scalable efficiency and strong recall or comprehension, as shown by substantial empirical gains at drastically lower computational cost.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Query-Guided Multi-View Tokenizer.