Channel-wise Token Concatenation

Updated 9 December 2025

Channel-wise token concatenation is a fusion method that concatenates unimodal token features along the channel dimension to preserve individual characteristics while enabling cross-modal updates.
The technique maintains a constant sequence length with increased feature granularity, facilitating efficient global attention in both vision-language and irregular time series models.
Empirical studies, such as in Compound Tokens and MTM, demonstrate improved accuracy and reduced runtime, outperforming traditional fusion methods in multiple benchmarks.

Channel-wise token concatenation refers to a class of fusion mechanisms wherein distinct token representations, typically from separate modalities or channels, are combined along their feature/channel dimension rather than the sequence dimension. This operation increases feature granularity per token without expanding the sequence length, yielding improved cross-channel or cross-modal alignment. The approach is prominent in vision-language representation fusion ("Compound Tokens" (Aladago et al., 2022)) and in irregular time series modeling ("MTM: Multi-Scale Token Mixing Transformer" (Zhong et al., 22 Sep 2025)). Channel-wise concatenation benefits from improved representational capacity and enables more effective global attention over fused tokens.

1. Formal Definitions and Fusion Equations

Let $N$ denote the number of tokens from channel (or modality) 1, and $M$ from channel 2, with target joint hidden dimension $d$ . For channel-wise concatenation, each modality is first linearly projected to $\mathbb{R}^{N \times \frac d2}$ and $\mathbb{R}^{M \times \frac d2}$ . Cross-attention modules induce cross-modal updates, which are then fused with the original query embeddings by concatenation along the channel dimension: $\mathcal I_{cmpd} = [\widetilde{\mathcal I} \;\Vert_{c}\; \widehat{\mathcal I}] \in \mathbb{R}^{N \times d}$

$\mathcal T_{cmpd} = [\widetilde{\mathcal T} \;\Vert_{c}\; \widehat{\mathcal T}] \in \mathbb{R}^{M \times d}$

The final fused sequence is concatenated along the token axis: $\mathcal O_{cmpd} = \mathrm{Concat}\bigl(\mathcal I_{cmpd},\,\mathcal T_{cmpd}\bigr) \in \mathbb{R}^{(N+M)\times d}$ In time series settings, channel-wise mixing may pertain to concatenation of pooled features (max and mean) per window: $u_{p,j} = W_p [m^{max}_{p,j}; m^{avg}_{p,j}] \in \mathbb{R}^D$ Channel-wise token mixing within a Token Mixing layer further leverages concatenation after pivotal token selection and attention-based context propagation (Zhong et al., 22 Sep 2025).

2. Architectures Incorporating Channel-wise Concatenation

Vision-language fusion models such as Compound Tokens (Aladago et al., 2022) utilize the following stages:

Image encoder (ResNet-50 or ViT-Base) projects visual features.
Text encoder (T5-Base) produces text embeddings.
Two symmetric cross-attention blocks: vision→text, text→vision.
Channel-wise fusion: concatenation of query features with cross-modal updates.
Sequence-level merge: fused compound tokens across both modalities concatenated.
Multimodal transformer encoder with global self-attention.
Task-specific decoder (auto-regressive T5, classifier for VQA).

For irregular multivariate time series (IMTS), MTM (Zhong et al., 22 Sep 2025) applies:

Per-channel input embedding with channel and positional encodings.
Masked concat pooling: windowed max/mean pooling per channel followed by concatenation and projection.
Multi-block token mixing: temporal attention (per channel), pivotal-token cross-channel filling, second attention pass, and channel-wise self-attention across all channels per timepoint.

3. Functional Advantages of Channel-wise Token Concatenation

Channel-wise fusion provides several notable benefits:

Token-level multimodal alignment: Each token preserves its unimodal signature and an explicit cross-modal residual, enhancing contextual integration.
Constant sequence length: The number of tokens remains $(N+M)$ , with feature dimension expanded to $d$ , avoiding quadratic cost in token count typical of sequence concatenation.
Improved gradient flow: Fusion via concatenation foregrounds both original and cross-modal features side by side, empirically outperforming additive or weighted alternatives (Aladago et al., 2022).
Efficient global attention: After fusion, a standard transformer encoder operates with global self-attention over all compound tokens from both channels.

This suggests channel-wise concatenation is more computationally efficient and supports richer representation learning than sequence-level merging or repeated cross-attention operations.

4. Channel-wise Token Mixing for Irregular Time Series

In MTM (Zhong et al., 22 Sep 2025), channel-wise token mixing addresses cross-channel asynchrony:

Masked concat pooling down-samples IMTS and pools per-channel tokens via max and mean, concatenated for robust representation at coarser time scales.
Token mixing layer propagates the most salient per-timepoint tokens (chosen via attention scores from per-channel CLS tokens) across all channels, filling missing positions and enabling channel-attention at each temporal slice.
Channel-wise attention updates per-timepoint, per-channel features via context aggregation over observed channels.

A plausible implication is that this mechanism reconstructs the joint channel structure even when input observations are highly asynchronous or missing.

5. Empirical Results and Comparative Analyses

Compound Tokens with channel-wise concatenation demonstrate increased effectiveness across multiple benchmarks (Aladago et al., 2022):

SNLI-VE: improvement from $78.70 \to 79.59$ (no VLP), further to $81.49$ with VLP.
GQA: rise from $75.62 \to 76.62$ (no VLP), up to $80.45$ with VLP.
VQA (open-vocab): $53.33 \to 57.51$ (+4.18%) with VLP.
Outperforms closed-vocab SOTA: SNLI-VE ( $82.87\%$ ), GQA ( $82.43\%$ ).

Fusion ablations indicate that channel-wise concatenation consistently beats learnable weighting, summation, and element-wise product under equal computational budgets.

MTM (Zhong et al., 22 Sep 2025) achieves up to $+3.8\%$ AUPRC on IMTS classification benchmarks while reducing runtime by 20–50% relative to previous Transformer approaches; this substantiates direct performance and efficiency gains attributed to channel-wise pooling and mixing.

Method	Task/Benchmark	Relative Gain
Compound Tokens	SNLI-VE	+2.26%
Compound Tokens	GQA	+8.83%
Compound Tokens	VQA-Open	+4.18%
MTM	IMTS AUPRC	+3.8%

6. Limitations and Open Questions

Identified constraints and future directions include:

Decoder sensitivity: Open-vocab VQA performance with encoder-decoder structures lags; switching to encoder-only classification is required (Aladago et al., 2022).
Fixed split design: Halving input channel dimensions prior to fusion may restrict representational expressiveness; dynamic channel splitting or gating is a potential extension.
Single-step fusion: The current protocol utilizes two cross-attention modules only; iterative or multi-layered channel-wise fusion remains unexplored.
Generalization to additional modalities: The procedure could be extended to domains beyond vision and language (audio, 3D), likely necessitating new cross-attention scheduling schemes.
Fusion layer depth: The impact of performing channel-wise fusion at varying depths/stages of the transformer stack has not been systematically studied.

These open questions highlight the developmental trajectory for channel-wise token concatenation, particularly regarding adaptability, scalability, and modality extensibility.

7. Contextualization Across Modalities and Model Classes

While channel-wise token concatenation was introduced for vision-language fusion (Aladago et al., 2022), related concepts underpin new architectures for time series modeling (Zhong et al., 22 Sep 2025). The underlying principle—fusing unimodal and cross-modal features as contiguous per-token vectors—transcends specific model types. This suggests broader applicability to multi-modal sequence processing tasks, including potential cross-domain and hierarchical fusion architectures.

The approach’s empirical success in both vision-language QA and IMTS classification indicates that channel-wise concatenation unlocks superior token-level alignment and robust learning under both regular and irregular sequence observation constraints.

Markdown Report Issue Upgrade to Chat

References (2)

Compound Tokens: Channel Fusion for Vision-Language Representation Learning (2022)

MTM: A Multi-Scale Token Mixing Transformer for Irregular Multivariate Time Series Classification (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Channel-wise Token Concatenation.