Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Key Merging in Vision-Language Models

Updated 24 January 2026
  • Dynamic Key Merging is a token reduction paradigm that adaptively compresses visual tokens based on image complexity, preserving task performance.
  • It employs Dynamic Token Merging (DToMe) to selectively compute token similarities and perform weighted merging according to per-layer budgets.
  • Virtual Token Unmerging (VTU) reconstructs full-length sequences in LLMs, enabling efficient cross-modal attention with minimal performance loss.

Dynamic Key Merging is a token reduction paradigm for vision-LLMs (VLMs) that targets the inefficiencies associated with fixed-length visual token outputs in transformer architectures. Central to this approach is the DyMU framework, which consists of Dynamic Token Merging (DToMe) for content-adaptive token compression and Virtual Token Unmerging (VTU) for attention-consistent expansion within LLMs. DyMU operates entirely training-free, introducing dynamic, image-dependent token budgets and efficient attention mechanisms that preserve downstream task performance across diverse VLM and LLM backbones (Wang et al., 23 Apr 2025).

1. Dynamic Token Merging: Principles and Workflow

Dynamic Token Merging (DToMe) addresses the inefficiency of uniform token budgets per image by selectively merging redundant visual token embeddings in a vision transformer’s output sequence. Given a token sequence xiRNi×Dx_i\in\mathbb{R}^{N_i\times D} at transformer layer ii, DToMe seeks to minimize NiN_i dynamically in response to image complexity.

Key Steps

  • Token-Pair Similarity via Bipartite Matching: Tokens are split into two interleaved sets A\mathbb{A} and B\mathbb{B} (e.g., by even/odd indices). For each tAt\in\mathbb{A}, a similarity score is computed via dot product between key projections kik_i:

tB=argmaxnBki[t],ki[n],Si[t]=ki[t],ki[tB]t_B = \arg\max_{n\in\mathbb{B}} \langle k_i[t], k_i[n] \rangle,\quad S_i[t] = \langle k_i[t], k_i[t_B] \rangle

  • Thresholding and Dynamic Merge Budget: A per-layer target merge count rir_i is selected. Offline, the algorithm determines a threshold τi\tau_i such that the number of retained edges Si(b)[t]>τiS_i^{(b)}[t] > \tau_i matches BriB r_i over a batch of BB images. At inference, τi\tau_i is fixed; images with higher redundancy yield more merges.
  • Merging Procedure: For each selected edge (Si[t]>τiS_i[t]>\tau_i), tokens are merged via a position-weighted average:

xi[tB]Pi[t]xi[t]+Pi[tB]xi[tB]Pi[t]+Pi[tB]x_i[t_B] \leftarrow \frac{|P_i[t]|\,x_i[t]+|P_i[t_B]|\,x_i[t_B]}{|P_i[t]|+|P_i[t_B]|}

Position sets Pi[t]\mathcal{P}_i[t] and Pi[tB]\mathcal{P}_i[t_B] are updated, with the merged token tt dropped from the active sequence.

  • Size-Weighted Self-Attention: To retain semantic coverage post-merge, scaled-dot attention incorporates a logarithmic token-size bias:

A=Softmax(QKTd+log[Pi[1],,Pi[Ni]])A = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}} + \log[|P_i[1]|,\ldots,|P_i[N_i]|]\right)

  • Image Complexity and Token Survival: JPEG-compression-based complexity C(I)=sizeJPEG(I)/(H×W)C(I)=\mathrm{size}_{\mathrm{JPEG}(I)}/(H\times W) correlates strongly with post-merge token count.
  • Layer-By-Layer Adaptivity: Merge budgets {ri}\{r_i\} may be fixed or decay linearly, enabling end-to-end compression schedule optimization.

Inference Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Input: image I of size HxW
x_1 <- ViT_patch_projection(I)      # N1 x D
P_1[t] <- {t} for t=1..N1
for i in 1..L:                     
    [x_i, k_i, Q_i, K_i, V_i] <- TransformerBlock(x_i, P_i)
    split tokens into A,B
    for each t in A:
        find t_B = argmax_{n in B} <k_i[t],k_i[n]>
        S[t] = <k_i[t],k_i[t_B]>
    find all edges where S[t] > tau[i]
    for each edge t->t_B:
        merge t into t_B via weighted average
        update P_i[t_B] = P_i[t_B]  P_i[t]
        drop token t
    x_{i+1} <- x_i[ tokens with nonempty P_i ]
    P_{i+1} <- corresponding P_i
endfor
return x_{L+1}, P_{L+1}   # final variable-length token set

2. Virtual Token Unmerging in LLMs

Virtual Token Unmerging (VTU) simulates the attention dynamics of a full-length sequence in LLMs when fed compressed visual tokens. This approach enables downstream models—trained on fixed-length sequences—to preserve alignment and performance without explicit architectural retraining.

Mechanism

  • Mapping and Sparse Expansion: Given a compressed set eunRNun×De_{\mathrm{un}}\in\mathbb{R}^{N_{un}\times D}, VTU constructs a sparse expansion matrix M{0,1}N×NunM\in\{0,1\}^{N\times N_{un}}, assigning each position in NN to its representative token:

e=Meune = M\,e_{\mathrm{un}}

  • Sequence-Independent Layers: For elementwise operations ff (e.g., GeLU, LayerNorm), evaluation is reduced to:

f(e)=Mf(eun)f(e) = M\,f(e_{\mathrm{un}})

  • RoPE-Based Self-Attention: Rotary Position Embedding (RoPE) attention, sensitive to absolute positions, is rederived in the compressed domain via four sparse-matrix terms:

A=CMQunKunTMTC+SMQunKunTMTS+A = C\,M\,Q_{\mathrm{un}}\,K_{\mathrm{un}}^T\,M^T\,C + S\,M\,Q_{\mathrm{un}}\,K_{\mathrm{un}}^T\,M^T\,S + \ldots

This reduces complexity from O(N2D)O(N^2D) to O(NNunD)O(N\,N_{un}\,D).

  • Attention Output: Softmax and value lookup are realized with:

f(e)=Softmax(AD)V,V=MVunf(e) = \mathrm{Softmax}\Bigl(\frac{A}{\sqrt{D}}\Bigr)\,V, \quad V = M\,V_{\mathrm{un}}

  • Re-Merging in Subsequent Layers: To restore NunN_{un}-row representations for subsequent blocks:

eun(next)=(MTM)1MTf(e)e_{\mathrm{un}}^{(\text{next})} = (M^TM)^{-1} M^T f(e)

This averages rows associated with the same group. Empirically, this incurs minimal degradation (<3%) in end-to-end performance.

3. Computational Efficiency and Performance Profile

DyMU introduces substantial computational savings with negligible loss of accuracy. Reported statistics include:

Configuration Token Count (LLaVA-1.5) FLOPs (Self-Attn) Performance (VQA avg.)
Full 576 1.36 GFLOPs 55.8
DyMU-low 89±2789 \pm 27 72 MFLOPs 54.5 (97.7%)
DyMU-mid 195±47195 \pm 47 358 MFLOPs 55.3 (99.1%)
DyMU-high 394±57394 \pm 57 1.27 GFLOPs 56.0 (100.4%)

For DyMU-low, token reduction reaches 84.5%84.5\% versus full model, with FLOPs savings up to 19×19\times, while performance remains within 97.7%97.7\% of baseline. Compared to ToMe—a previous fixed-length, training-free technique—DyMU-low obtains higher average scores (54.5 vs. 52.6) and adaptively varies token count per image.

A plausible implication is that token-budgeted models using DyMU can flexibly allocate compute without retraining or manual thresholding, which is substantiated by its compatibility with diverse visual encoders (CLIP, SigLIP) and LLMs (Vicuna-7B, Qwen-2.5).

4. Adaptive Compression and Control Mechanisms

DyMU’s dynamic scheme confers granular control over computational expenses via per-layer and per-image merge budgets. The offline threshold calibration ensures average budgets are met across batches, while individual images—assessed by JPEG-based complexity—yield contextually appropriate token reductions. This adaptation enables finer resource allocation, which may be critical for real-time or resource-constrained inference scenarios.

Layer-by-layer budget strategies (constant or decaying) allow further fine-tuning of the compression curve, facilitating trade-offs between throughput and task-specific accuracy.

5. Integration and Applicability Across Model Architectures

DyMU is expressly designed as a training-free, plug-and-play solution. Its token merging and virtual expansion modules are compatible with mainstream transformer vision encoders (including AnyRes variants), and downstream LLMs with standard attention stacks and RoPE position embeddings. The absence of retraining requirements streamlines deployment onto existing model checkpoints, contributing to broad utility in both research prototypes and production VLM systems.

Empirical evaluation on image and video understanding benchmarks confirms near-parity in downstream performance (exceeding 96%96\% of baseline accuracy with \sim15\%$ token budgets), signaling robust generalization across tasks. The framework’s modularity suggests straightforward extension to other multimodal architectures using fixed-length cross-modal sequences.

6. Context, Comparison, and Implications

DyMU’s innovation lies in dynamic, content-adaptive merging and efficient re-expansion, contrasting with traditional static token selection paradigms. Compared to fixed-length methods (e.g., ToMe), DyMU achieves higher accuracy while providing per-image flexibility. This suggests a paradigm shift in VLM design, emphasizing input-dependent computational scaling.

Experimental results on VQA-style benchmarks and ablation studies demonstrate that DyMU’s approximation in VTU incurs only minor accuracy penalties, even with large reductions in token count and floating-point operations. A plausible implication is that future VLM deployments may increasingly favor dynamic key merging architectures for efficiency-critical applications.

No significant controversies are noted; however, ongoing work may investigate the limits of compression for highly complex images and the interaction of variable-length representations with specialized LLM attention mechanisms. The methodology described in DyMU is fully documented in (Wang et al., 23 Apr 2025) and remains a reference point for dynamic token reduction research in the vision-language domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Key Merging.