Input-Dependent Scale-Mixer

Updated 5 January 2026

Input-Dependent Scale-Mixer is an adaptive mechanism that generates input-specific convolution kernels and attention weights to enhance local and global feature aggregation.
The Dual Dynamic Token Mixer (D-Mixer) architecture fuses overlapping spatial reduction attention with input-dependent depthwise convolutions for improved representational capacity.
Empirical evaluations on benchmarks like ImageNet-1K demonstrate that balanced mixing delivers competitive accuracy with reduced computational overhead compared to traditional static aggregators.

An Input-Dependent Scale-Mixer is an architectural component designed to dynamically modulate feature aggregation operations in response to the specific input instance. Rather than utilizing fixed aggregation weights or convolutional kernels, Input-Dependent Scale-Mixers generate weights or filters adaptively based on the global or local context of the current input, enabling greater representational capacity and adaptability. The Dual Dynamic Token Mixer (D-Mixer) in TransXNet exemplifies this paradigm by fusing input-dependent depthwise convolutions with global attention mechanisms, creating a hybrid that captures both local, context-sensitive structures and non-local dependencies for high-performance visual recognition tasks (Lou et al., 2023).

1. Motivations and Limitations of Static Aggregators

The standard convolution operation in deep visual recognition networks provides a fixed, input-invariant local feature aggregation, determined solely by learned filter weights. In contrast, attention-based mechanisms compute input-dependent weights, enabling dynamic adaptation to the spatial or semantic context of each input. Fusing static convolutions with dynamic self-attention presents representational mismatch and inhibits effective integration of previously attended features. Specifically, when stacking such hybrid token mixers, static convolution kernels cannot adapt to the global context captured by preceding self-attention layers, diminishing overall expressiveness (Lou et al., 2023). The Input-Dependent Scale-Mixer approach addresses these limitations by enabling both aggregation branches—local and global—to respond adaptively to the current input, ensuring optimal feature fusion.

2. Dual Dynamic Token Mixer (D-Mixer) Architecture

D-Mixer operates on a feature-map tensor $X\in\mathbb R^{C\times H\times W}$ and splits it channel-wise into two equal partitions, $X_1$ and $X_2$ , for parallel global and local dynamic processing:

Branch A (Global mixer): Employs the Overlapping Spatial Reduction Attention (OSRA) module. OSRA compresses spatial tokens using a strided, overlapping depthwise convolution, projects features to queries, keys, and values through linear layers, applies scaled dot-product attention with a relative positional bias, and aggregates results via softmax-weighted summation.
Branch B (Local mixer): Utilizes Input-dependent Depthwise Convolution (IDConv), dynamically generating a unique $K\times K$ kernel for each channel. The kernel weights are a function of the global-pooled context of $X_2$ , computed through a sequence of pointwise convolutions and softmax-normalized group-wise weighting of a set of learnable basis kernels.
Fusion: The outputs of the two branches, $Y_1$ (global) and $Y_2$ (local), are concatenated along the channel axis and may be further refined by a lightweight Squeezed Token Enhancer (STE).

This split-mix-refine workflow endows the network with input dependency across both aggregation paths, balancing local inductive bias and global receptive field (Lou et al., 2023).

3. Mathematical Formalization of Input-Dependent Mixing

3.1 IDConv Kernel Synthesis

Given $X_2\in\mathbb R^{C'\times H\times W}$ ( $C'=\tfrac C2$ ), the IDConv pipeline proceeds:

Global pooling: $u = \mathrm{AdaptivePool}(X_2)\in\mathbb R^{C'\times 1\times 1}$ .
Projection: $z = \mathrm{Conv}_{1\times 1}^{C'\to C'/r}(u)$ .
Grouped activation: $A' = \mathrm{Conv}_{1\times 1}^{C'/r\to G\,C'}(z) \in \mathbb R^{(G\,C')\times 1\times 1}$ .
Softmax normalization on group axis:

$A' \to \hat{A} \in \mathbb R^{G\times C'\times K^2}, \quad A_{i,c,:} = \frac{\exp(\hat{A}_{i,c,:})}{\sum_{j=1}^G \exp(\hat{A}_{j,c,:})}$

Synthesis of per-channel kernel via weighted sum over $G$ learnable basis kernels $P \in \mathbb R^{G\times C'\times K^2}$ :

$W_{c,:} = \sum_{i=1}^G A_{i,c,:} P_{i,c,:}, \quad W \in \mathbb R^{C'\times K^2}$

Depthwise convolution with synthesized kernels: $Y_2 = \mathrm{DWConv}(X_2; W)$ .

3.2 OSRA Attention Computation

For $X_1\in\mathbb R^{C'\times H\times W}$ :

Spatial reduction via strided DWConv: $Y_r = \mathrm{OSR}(X_1)$ .
Linear refinement and projection: $Q = W_q X_1$ , $[K, V] = \mathrm{Split}(W_{kv}(Y_r + \mathrm{LR}(Y_r)))$ .
Attention allocation with relative position bias:

$A = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}} + B\right),\quad Y_1 = AV$

Final concatenation and enhancement:

$Z = \mathrm{Concat}(Y_1, Y_2) \in \mathbb R^{C\times H\times W},\quad Y = \mathrm{STE}(Z)$

This scheme implements input-dependent mixing in both aggregation paths, with each branch's scale-mixer weights modulated by the sample-specific context.

4. Architectural Hyperparameters and Flexibility

The effectiveness and computational profile of the Input-Dependent Scale-Mixer are governed by several key hyperparameters:

Hyperparameter	Typical values	Role
Split ratio	1:1	Equal channel allocation
Kernel size $K$	$7\times 7$	Local receptive field size
Attention groups $G$	$2$ (tiny) to $4$ (deep)	IDConv flexibility/cost
Reduction ratio $r$	$4$–$8$	MLP size for context compression
OSRA stride $S$	variable	Controls spatial reduction
Number of heads $H$	variable	Attention granularity

Modifying these parameters mediates the trade-off between inductive bias, computational efficiency (FLOPs), parameter count, and effective receptive field (Lou et al., 2023).

5. Empirical Performance and Ablation Observations

Experimental benchmarks on the ImageNet-1K dataset demonstrate the efficiency and accuracy improvements conferred by Input-Dependent Scale-Mixing:

TransXNet-T (D-Mixer): 81.6% top-1 (1.8 GFLOPs, 12.8M params) vs. Swin-T: 81.3% (4.5 GFLOPs, 29M params)
TransXNet-S: 83.8% top-1 (4.5 GFLOPs, 26.9M) vs. InternImage-T: 83.5% (5.0 GFLOPs, 30M)
TransXNet-B: 84.6% top-1 (8.3 GFLOPs, 48M) (Lou et al., 2023)

Ablation analyses reveal:

Static DWConv: 80.3% top-1
DyConv: 80.7%
D-DWConv: 80.9%
IDConv: 80.9% (smaller parameter overhead, larger receptive field)

Comparison of mixer variants (equal cost regime) shows D-Mixer matches or exceeds alternatives at lower computational overhead:

DWConv-only: 76.9%,
SRA-only: 77.4%,
Swin-window: 78.2%,
MixFormer: 78.9%,
ACmix: 79.0%,
D-Mixer: 79.0% (1.6 GFLOPs vs 2.0–2.1 GFLOPs) (Lou et al., 2023).

Empirically, a balanced (50:50) split between local and global branches yields optimal accuracy; allocating more channels to either attention or convolution produces diminishing returns.

6. Interpretability, Generalization, and Broader Context

Input-Dependent Scale-Mixers leverage input-conditioned modulation for feature aggregation, supporting both interpretability and adaptation to varying contexts. While D-Mixer is primarily instantiated for visual recognition, the input-dependent conception extends naturally to mutual-information-driven feature reduction in nonlinear, data-based physical modeling (Beneddine, 2022). In FeDis, the input-reduction mapping $E_A$ is optimized to capture nonlinear dependencies via mutual information, leading to interpretable exponents and analytic scaling laws. A plausible implication is that input-dependent scale-mixing strategies generalize across domains, unifying principles of dynamic aggregation for both vision and scientific modeling tasks.

The Input-Dependent Scale-Mixer paradigm is characterized by its dual strengths: (a) dynamically adapting local aggregation kernels to the specific input context, and (b) integrating non-local, globally attended features at every stage, resulting in enlarged receptive fields, strong task-specific inductive biases, and favorable computational scaling. This suggests future directions leveraging richer forms of input-dependent mixing (e.g., hierarchical or multi-scale modulation), potentially accelerating adaptation and generalization in deep learning architectures.

Markdown Report Issue Upgrade to Chat

References (2)

TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition (2023)

Nonlinear input feature reduction for data-based physical modeling (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Input-Dependent Scale-Mixer.