Multi-Patch-One-Token Projection

Updated 27 January 2026

Multi-Patch-One-Token projection is a method that compresses large sets of input tokens into fewer summary tokens, enabling efficient processing in deep networks.
It employs techniques like learned projection, cosine similarity fusion, and entropy-based patchification across vision, language, and time series applications.
Empirical results demonstrate significant reductions in FLOPs and runtime with minimal performance loss, supporting scalable deployment in multimodal models.

A Multi-Patch-One-Token projection is a general class of architectural modules that reduce a set of $N$ input tokens or patches—often representing spatial, temporal, or modality-specific partitions—into $M \ll N$ summary tokens. This projection compresses local or redundant information into a more compact representation, enabling resource-efficient processing in modern deep networks, especially Transformers. Instantiations have emerged across vision transformers, multimodal LLMs, autoregressive generative models, time series models, and multiple instance learning, each adapting the core concept to domain-specific constraints and architectures.

1. Mathematical Formulations of Multi-Patch-One-Token Modules

The essential operation in Multi-Patch-One-Token projection is to map an incoming set of tokens $X \in \mathbb{R}^{N\times D}$ to a reduced set $Y \in \mathbb{R}^{M\times D}$ through a mechanism that preserves salient information. Prominent instantiations include:

PatchMerger (Vision Transformer): Uses an end-to-end learned projection matrix $W \in \mathbb{R}^{D \times M}$ . For each input token $x_i$ , compute unnormalized similarity scores to each of $M$ learned queries, normalize via row-wise softmax, and aggregate:

$Y = \left[\mathrm{softmax}(XW)\right]^T X\;\in\;\mathbb{R}^{M\times D}$

Each output token is a global, data-adaptive weighted sum of all input tokens (Renggli et al., 2022).

ToFu (Token Fusion for LMMs): Sequentially scans input visual tokens, measuring cosine similarity to existing merged tokens and fusing similar ones via a weighted average, using a data- and token-count-adaptive threshold $\tau$ . The clustering operation is:

$t_{new} = \frac{w\,t + v}{w + 1}$

where $v$ is the incoming token, $t$ is the nearest (in cosine similarity) cluster, and $w$ is its cumulative weight (Pippi et al., 6 Mar 2025).

Next Patch Prediction (Autoregressive Generation): Non-overlapping, contiguous patches of tokens are grouped, and their embeddings averaged. No additional weights are introduced; each patch embedding is:

$x^{\mathrm{patch}_i} = \frac{1}{K}\sum_{k=1}^K E(x^i_k)$

where $K$ is the patch size (Pang et al., 2024).

Patch Selector Summarization (Multi-Label Incremental Learning): For each of $L_s \ll L$ "Patch Selector" vectors, perform softmax attention over all input tokens and aggregate via weighted sum:

$\bar h_{t,j}^{(\ell)} = \sum_{k=1}^{L+1} \alpha_{t,j}[k]\, H^{(\ell)}[k]$

with $\alpha_{t,j} = \mathrm{softmax}\left((1/\sqrt{D}) H^{(\ell)} s_{t,j} \right)$ (Min et al., 2024).

CAPRMIL (Context-Aware Patch Representations, MIL): Soft cluster $N$ patch embeddings into $M$ global tokens using head-specific projection, per-head learnable temperature, and weighted averaging:

$S_{b,h,m} = \frac{\sum_{n=1}^{N}W_{b,h,n,m}\,\tilde{f}_{b,h,n}}{\sum_{n=1}^{N}W_{b,h,n,m}+\varepsilon}$

where $W_{b,h,n,m}$ are softmax-normalized cluster assignment weights (Lolos et al., 16 Dec 2025).

DPAR (Dynamic Patchification, Autoregressive Decoding): Contiguous, dynamically-sized patch boundaries are predicted via local per-token entropy, and each patch's tokens are merged using cross-attention between a "patch query" and the set of constituent token embeddings (Srivastava et al., 26 Dec 2025).

2. Architectural Integration and Workflow

Multi-Patch-One-Token projection modules are introduced at different network stages, depending on architectural constraints and application:

Mid-Transformer: PatchMerger is inserted between Transformer encoder blocks, typically after $L/2$ layers. The token count transition from $N$ to $M$ reduces the computational burden on subsequent blocks (Renggli et al., 2022).
Post-Adapter/Encoder: ToFu is agnostic to the visual encoder backbone and operates after the visual adapter, prior to fusion with text or downstream LLM modules (Pippi et al., 6 Mar 2025).
Input Preprocessing and Gradual Annealing: Next Patch Prediction applies patch averaging before the Transformer backbone, annealing the patch size to $1\times1$ at the end of training to maintain compatibility with standard autoregressive inference (Pang et al., 2024).
Per-Layer Summarization: Multi-label class-incremental learning models apply summarization at every transformer block, allowing dynamic, per-task token reduction and focusing of "patch selectors" to semantic or region-specific content (Min et al., 2024).
MIL Bag Aggregation: In MIL, all slide patches are projected to a compact global representation, enabling scalable attention modules in whole-slide learning (Lolos et al., 16 Dec 2025).
Auxiliary Token Compression: Time series models extract patches at multiple resolutions and perform per-resolution projection prior to concatenating other static/covariate tokens (Peršak et al., 2024).
Dynamic Patchification: DPAR introduces a patchification stage where token-level entropy from a pretrained autoregressive model determines the patch boundaries, resulting in a variable-length sequence of aggregated embeddings for the Transformer (Srivastava et al., 26 Dec 2025).

3. Computational Complexity and Efficiency Gains

Multi-Patch-One-Token modules principally reduce computational complexity in self-attention blocks by lowering sequence length:

Method	Input Tokens (N)	Output Tokens (M)	Key FLOPs Reduction Features
PatchMerger	$N$	$M \ll N$	Drops per-layer self-attention from $O(N^2D)$ to $O(M^2D)$ post-merger; $\approx$ 48% FLOPs, 36% runtime saved in large ViT (Renggli et al., 2022)
ToFu	$M$	$K \ll M$	Reduces visual tokens by $\sim$ 59%, halves attention FLOPs, cuts GPU memory by $\sim$ 66% in LMMs (Pippi et al., 6 Mar 2025)
NPP	$N$	$N/K$	Sequences shortened during coarse-to-fine training; up to 40% cost reduction, zero overhead at inference (Pang et al., 2024)
MULTI-LANE	$L+1$	$L_s \ll L$	Trades $O(L^2)$ ViT attention for $O(L_s^2)$ , enabling per-task pathways without quadratic cost explosion (Min et al., 2024)
CAPRMIL	$N$	$M \ll N$	Reduces from $O(BN^2D)$ to $O(BND)$ , cuts parameters by 48–92.8%, FLOPs by 52–99% (Lolos et al., 16 Dec 2025)
DPAR	$T$	$M$ (dynamic)	Average patch length up to $\sim$ 2, $1.81\times$ token count reduction, $\sim$ 40% FLOPs savings (Srivastava et al., 26 Dec 2025)
Multiple-Resolution	depends	$N_{MRP}$	Summarizes temporal sequences at many scales, compresses input for efficient forecasting (Peršak et al., 2024)

4. Empirical Results and Key Ablations

Rigorous empirical assessment confirms that Multi-Patch-One-Token projection delivers substantial efficiency gains without compromising predictive performance:

PatchMerger: ViT-H/14 model with PatchMerger matches full-model JFT Prec@1 (56.56%) and few-shot ImageNet-1K score ( $\approx79.1\%$ ) at $\sim$ 51.6% FLOPs; fine-tuned top-1 accuracy is 87.90% vs. 87.97% for baseline (Renggli et al., 2022). Placement must not be too early; best trade-off is near mid-network.
ToFu: Token count reduced by $\sim$ 59% ( $\pm28\%$ ) yields accuracy gains over no reduction and random sampling in multi-image LMM tasks (e.g., accuracy 35.31% with ToFu vs. 33.79% baseline) (Pippi et al., 6 Mar 2025).
NPP: Reduces training cost to $\sim$ 0.6 $\times$ with 1.0 FID improvement on ImageNet 256 $\times$ 256; retains inference compatibility by annealing patch size to $1\times1$ (Pang et al., 2024).
MULTI-LANE: Achieves 3–5pp final mAP improvement over other rehearsal-free incremental learning baselines, matches or surpasses rehearsal-based methods using minimal parameters, saturates performance at $L_s=10$ –20 summarized tokens (Min et al., 2024).
CAPRMIL: Slide-level MIL on pathology yields AUC within 1 standard deviation of SOTA, but with 2 $\times$ –30 $\times$ parameter/FLOPs reduction. For example, AUC 0.975 on CAMELYON16 at 0.314M params vs. 0.987 at 0.66M for ABMIL (Lolos et al., 16 Dec 2025).
DPAR: On ImageNet, cuts token length by $1.81\times$ (256px), $2.06\times$ (384px), achieving up to 40% FLOPs savings and reducing FID by up to 27.1% compared to token-based baselines (Srivastava et al., 26 Dec 2025).

5. Domain-Specific Extensions and Limitations

Multi-Patch-One-Token projections have been customized for multiple domains:

Autoregressive Generation: NPP and DPAR demonstrate token grouping strategies for efficient or adaptive image synthesis, leveraging deterministic (averaging) or data-driven (entropy-threshold) patchification, maintaining compatibility with generative APIs (Pang et al., 2024, Srivastava et al., 26 Dec 2025).
Vision-Language and Multimodal LLMs: ToFu aggressively reduces visual tokens from multiple images or high-res data for efficient fusion with language, outperforming random selection and other token dropping methods (Pippi et al., 6 Mar 2025).
Class-Incremental and Task-Prompted Learning: Multi-LABEL approaches use per-task patch summarization and efficient small-prompted heads, eliminating the combinatorial explosion from separate pathways and retaining task-specific focus (Min et al., 2024).
Computational Pathology (MIL): CAPRMIL outperforms complex attention-based pooling aggregators by projecting hundreds of thousands of instance patches into a handful of context-enriched summary tokens before mean aggregation (Lolos et al., 16 Dec 2025).
Forecasting: Multiple-Resolution patching in time series allows transformers to jointly leverage information at both fine and coarse temporal resolutions (Peršak et al., 2024).

Limitations observed include:

Overaggressive patchification can underfit by discarding discriminative details (e.g., too small $M$ ).
Sequential token fusion (e.g., ToFu) is suboptimal compared to global clustering but is computationally simpler.
Patch fusion designs may not preserve precise positional locality; some domain-specific tasks require custom integration of positional encodings.
Non-differentiable grouping rules (e.g., entropy-based) do not propagate gradients; learned projections avoid this issue.
ToFu’s threshold $\tau$ is empirically set, and inappropriate values can lead to under- or over-fusion.

6. Generalization, Practical Considerations, and Future Extensions

Multi-Patch-One-Token projection modules are encoder- and architecture-agnostic in many variants. They are compatible with any transformer-based model where quadratic scaling in token length is a bottleneck. Several designs permit training-free or plug-and-play integration (e.g., ToFu, NPP).

Practical deployment guidelines:

Place the module at mid-network or immediately prior to costly quadratic-complexity layers for maximal efficiency gain.
In multimodal or multi-task contexts, per-task selectors or fusion windows are effective for modular compression.
When domain structure is present (grid, semantic grouping), hybrid selectors or initialization can bootstrap projection.
Additional extensions proposed include learnable distinctiveness thresholds (ToFu), hierarchical or multi-stage merging, localized fusion windows, and combining token fusion with dynamic token dropping.

Future work includes more adaptive token fusion policies, joint text–vision fusion, and integration of dynamic selection criteria into the network’s learning signal.

In summary, Multi-Patch-One-Token projections provide a versatile, resource-efficient framework for compression and abstraction in high-dimensional transformer workloads, preserving or improving task accuracy while substantially reducing hardware and runtime demands across modalities and domains (Renggli et al., 2022, Pippi et al., 6 Mar 2025, Pang et al., 2024, Min et al., 2024, Peršak et al., 2024, Lolos et al., 16 Dec 2025, Srivastava et al., 26 Dec 2025).