Dynamic Key Merging in Vision-Language Models

Updated 24 January 2026

Dynamic Key Merging is a token reduction paradigm that adaptively compresses visual tokens based on image complexity, preserving task performance.
It employs Dynamic Token Merging (DToMe) to selectively compute token similarities and perform weighted merging according to per-layer budgets.
Virtual Token Unmerging (VTU) reconstructs full-length sequences in LLMs, enabling efficient cross-modal attention with minimal performance loss.

Dynamic Key Merging is a token reduction paradigm for vision-LLMs (VLMs) that targets the inefficiencies associated with fixed-length visual token outputs in transformer architectures. Central to this approach is the DyMU framework, which consists of Dynamic Token Merging (DToMe) for content-adaptive token compression and Virtual Token Unmerging (VTU) for attention-consistent expansion within LLMs. DyMU operates entirely training-free, introducing dynamic, image-dependent token budgets and efficient attention mechanisms that preserve downstream task performance across diverse VLM and LLM backbones (Wang et al., 23 Apr 2025).

1. Dynamic Token Merging: Principles and Workflow

Dynamic Token Merging (DToMe) addresses the inefficiency of uniform token budgets per image by selectively merging redundant visual token embeddings in a vision transformer’s output sequence. Given a token sequence $x_i\in\mathbb{R}^{N_i\times D}$ at transformer layer $i$ , DToMe seeks to minimize $N_i$ dynamically in response to image complexity.

Key Steps

Token-Pair Similarity via Bipartite Matching: Tokens are split into two interleaved sets $\mathbb{A}$ and $\mathbb{B}$ (e.g., by even/odd indices). For each $t\in\mathbb{A}$ , a similarity score is computed via dot product between key projections $k_i$ :

$t_B = \arg\max_{n\in\mathbb{B}} \langle k_i[t], k_i[n] \rangle,\quad S_i[t] = \langle k_i[t], k_i[t_B] \rangle$

Thresholding and Dynamic Merge Budget: A per-layer target merge count $r_i$ is selected. Offline, the algorithm determines a threshold $\tau_i$ such that the number of retained edges $i$ 0 matches $i$ 1 over a batch of $i$ 2 images. At inference, $i$ 3 is fixed; images with higher redundancy yield more merges.
Merging Procedure: For each selected edge ( $i$ 4), tokens are merged via a position-weighted average:

$i$ 5

Position sets $i$ 6 and $i$ 7 are updated, with the merged token $i$ 8 dropped from the active sequence.

Size-Weighted Self-Attention: To retain semantic coverage post-merge, scaled-dot attention incorporates a logarithmic token-size bias:

$i$ 9

Image Complexity and Token Survival: JPEG-compression-based complexity $N_i$ 0 correlates strongly with post-merge token count.
Layer-By-Layer Adaptivity: Merge budgets $N_i$ 1 may be fixed or decay linearly, enabling end-to-end compression schedule optimization.

Inference Pseudocode

$\mathbb{B}$ 3

2. Virtual Token Unmerging in LLMs

Virtual Token Unmerging (VTU) simulates the attention dynamics of a full-length sequence in LLMs when fed compressed visual tokens. This approach enables downstream models—trained on fixed-length sequences—to preserve alignment and performance without explicit architectural retraining.

Mechanism

Mapping and Sparse Expansion: Given a compressed set $N_i$ 2, VTU constructs a sparse expansion matrix $N_i$ 3, assigning each position in $N_i$ 4 to its representative token:

$N_i$ 5

Sequence-Independent Layers: For elementwise operations $N_i$ 6 (e.g., GeLU, LayerNorm), evaluation is reduced to:

$N_i$ 7

RoPE-Based Self-Attention: Rotary Position Embedding (RoPE) attention, sensitive to absolute positions, is rederived in the compressed domain via four sparse-matrix terms:

$N_i$ 8

This reduces complexity from $N_i$ 9 to $\mathbb{A}$ 0.

Attention Output: Softmax and value lookup are realized with:

$\mathbb{A}$ 1

Re-Merging in Subsequent Layers: To restore $\mathbb{A}$ 2-row representations for subsequent blocks:

$\mathbb{A}$ 3

This averages rows associated with the same group. Empirically, this incurs minimal degradation (<3%) in end-to-end performance.

3. Computational Efficiency and Performance Profile

DyMU introduces substantial computational savings with negligible loss of accuracy. Reported statistics include:

Configuration	Token Count (LLaVA-1.5)	FLOPs (Self-Attn)	Performance (VQA avg.)
Full	576	1.36 GFLOPs	55.8
DyMU-low	$\mathbb{A}$ 4	72 MFLOPs	54.5 (97.7%)
DyMU-mid	$\mathbb{A}$ 5	358 MFLOPs	55.3 (99.1%)
DyMU-high	$\mathbb{A}$ 6	1.27 GFLOPs	56.0 (100.4%)

For DyMU-low, token reduction reaches $\mathbb{A}$ 7 versus full model, with FLOPs savings up to $\mathbb{A}$ 8, while performance remains within $\mathbb{A}$ 9 of baseline. Compared to ToMe—a previous fixed-length, training-free technique—DyMU-low obtains higher average scores (54.5 vs. 52.6) and adaptively varies token count per image.

A plausible implication is that token-budgeted models using DyMU can flexibly allocate compute without retraining or manual thresholding, which is substantiated by its compatibility with diverse visual encoders (CLIP, SigLIP) and LLMs (Vicuna-7B, Qwen-2.5).

4. Adaptive Compression and Control Mechanisms

DyMU’s dynamic scheme confers granular control over computational expenses via per-layer and per-image merge budgets. The offline threshold calibration ensures average budgets are met across batches, while individual images—assessed by JPEG-based complexity—yield contextually appropriate token reductions. This adaptation enables finer resource allocation, which may be critical for real-time or resource-constrained inference scenarios.

Layer-by-layer budget strategies (constant or decaying) allow further fine-tuning of the compression curve, facilitating trade-offs between throughput and task-specific accuracy.

5. Integration and Applicability Across Model Architectures

DyMU is expressly designed as a training-free, plug-and-play solution. Its token merging and virtual expansion modules are compatible with mainstream transformer vision encoders (including AnyRes variants), and downstream LLMs with standard attention stacks and RoPE position embeddings. The absence of retraining requirements streamlines deployment onto existing model checkpoints, contributing to broad utility in both research prototypes and production VLM systems.

Empirical evaluation on image and video understanding benchmarks confirms near-parity in downstream performance (exceeding $\mathbb{B}$ 0 of baseline accuracy with %%%%4 $i$ 4%%%%2 token budgets), signaling robust generalization across tasks. The framework’s modularity suggests straightforward extension to other multimodal architectures using fixed-length cross-modal sequences.

6. Context, Comparison, and Implications

DyMU’s innovation lies in dynamic, content-adaptive merging and efficient re-expansion, contrasting with traditional static token selection paradigms. Compared to fixed-length methods (e.g., ToMe), DyMU achieves higher accuracy while providing per-image flexibility. This suggests a paradigm shift in VLM design, emphasizing input-dependent computational scaling.

Experimental results on VQA-style benchmarks and ablation studies demonstrate that DyMU’s approximation in VTU incurs only minor accuracy penalties, even with large reductions in token count and floating-point operations. A plausible implication is that future VLM deployments may increasingly favor dynamic key merging architectures for efficiency-critical applications.

No significant controversies are noted; however, ongoing work may investigate the limits of compression for highly complex images and the interaction of variable-length representations with specialized LLM attention mechanisms. The methodology described in DyMU is fully documented in (Wang et al., 23 Apr 2025) and remains a reference point for dynamic token reduction research in the vision-language domain.

Markdown Report Issue Upgrade to Chat

References (1)

DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Key Merging.