Multi-Branch Selective Kernel Module

Updated 26 January 2026

Multi-Branch SK Module is a convolutional neural network block that dynamically fuses branches with varied receptive fields to capture multi-scale features efficiently.
It employs a split-transform-fuse-select workflow using global pooling and softmax attention to adaptively weight convolution outputs per channel.
Empirical studies show SK modules improve accuracy in image classification, segmentation, and speaker verification while maintaining low computational overhead.

A Multi-Branch Selective Kernel (SK) Module is a convolutional neural network (CNN) building block that adaptively combines multiple filter branches with different receptive fields via attention-based selection. Each output spatial location or channel can dynamically leverage different spatial contexts, enabling more effective multi-scale pattern extraction at minimal computational overhead. SK modules generalize plain convolution by introducing a split-transform-fuse-select workflow, where multiple parallel convolutions (of varying kernel/dilation) are fused via attention mechanisms—often implemented as global average pooling, multi-layer perceptrons, and softmax weighting. The selective kernel paradigm is extended and specialized in variants including deformable convolutions, asymmetric convolutions, and frequency/scale-aware gating.

1. Architectural Principles

A prototypical Multi-Branch SK Module consists of the following stages:

Split: The feature map $X\in \mathbb{R}^{C\times H\times W}$ (or $H\times W\times C$ ) is fed to $M$ parallel convolutional branches with distinct receptive field properties. The branches employ kernel size variation (e.g., $3\times 3$ vs $5\times 5$ ), dilation, or other kernel generalizations (e.g., deformable or asymmetric kernels) (Li et al., 2019, Li et al., 2022, Zeng et al., 19 Jan 2026).
Fuse: Branch outputs $\{U_i\}_{i=1}^M$ are fused, most commonly via element-wise summation $U = \sum_i U_i$ but occasionally by concatenation (Fu et al., 2024).
Global Descriptor: Through global average pooling, the fused representation is summarized as a descriptor $s$ (typically a channel-wise vector).
Attention Generation: One or more fully connected layers, with optional nonlinearity and normalization (ReLU, BN), process $s$ to produce branch-wise attention logits. A softmax layer normalizes these scores to yield a convex combination across branches for each channel (Li et al., 2019).
Select: Weighted summation of branches, modulated by attention weights $a_i$ , forms the module output: $V = \sum_{i=1}^M a_i \odot U_i$ , where $a_i$ is broadcast spatially if computed per-channel.

Variants tailor this process by introducing deformable convolutions (FSKNet (Li et al., 2022)), asymmetric convolution (SKANet (Zeng et al., 19 Jan 2026)), or complex pooling/gating strategies (frequency-aware selection (Mun et al., 2022)).

2. Mathematical Formalism

The canonical SK attention process for two or more branches is as follows (Li et al., 2019, Li et al., 2022):

Branch computation:

$U_i = F_i(X), \quad i=1,\dots, M$

where $F_i$ denotes convolution with kernel size $k_i$ , dilation, or other transform.

Fusion (summation):

$U = \sum_{i=1}^M U_i$

Global pooling:

$s_c = \frac{1}{H W} \sum_{h=1}^H \sum_{w=1}^W U_c(h, w), \quad s\in \mathbb{R}^C$

Dimensionality reduction (bottleneck):

$z = \delta(\mathrm{BN}(W s)), \quad z\in \mathbb{R}^d$

Attention logits per branch:

$t_i = W_i z, \quad t_i\in \mathbb{R}^C$

Softmax over branches (for each channel $c$ ):

$a_i(c) = \frac{\exp[t_i(c)]}{\sum_{j=1}^M \exp[t_j(c)]}$

Weighted selection:

$V_c(h, w) = \sum_{i=1}^M a_i(c) U_i(c, h, w)$

Specialized modules adjust this pattern by, e.g., generating selection weights from a spatial descriptor (Fu et al., 2024), integrating Squeeze-and-Excitation logic (Li et al., 2022, Wang et al., 2020), or dynamically modulating attention over frequency or channel axes (Mun et al., 2022).

3. Kernel Branch Design and Variants

Kernel Size and Dilation: Typical SK modules employ $3\times3$ and $5\times5$ convolutions (either as explicit kernels or via dilation). Larger receptive fields (e.g., $7\times7$ or progressive dilation) are implemented in very wide or spatially rich tasks (Fu et al., 2024). Deformable convolutions allow spatially adaptive sampling, further increasing the adaptive field (Li et al., 2022).
Number of Branches: Two-branch SK units (3 and 5) are most common; three-branch SK units (e.g., 3, 5, 7 or increasing dilation) have been explored for richer scale mixture (Li et al., 2019, Fu et al., 2024, Zeng et al., 19 Jan 2026).
Specialized Convolution Types: SKANet introduces an Asymmetric Convolution Block in each branch, summing $3\times3$ , $1\times3$ , and $3\times1$ convolutions (with shared channels), and applies different dilations for multi-scale context. During inference, kernels are fused into a single $3\times3$ kernel to avoid overhead (Zeng et al., 19 Jan 2026).
Attention Mechanisms: The Squeeze-and-Excitation submodule plays a critical role: global pooling and MLP-based gating are used universally, with per-channel softmax gating as the primary selection mechanism (Li et al., 2019, Li et al., 2022, Wang et al., 2020). Frequency- and scale-selective attention is applied in speaker verification models (Mun et al., 2022), pooling along alternative axes.

4. Integration into Architectures and Workflows

SK modules are used as drop-in replacements for standard convolutions in both backbone and decoder/encoder blocks:

Image Classification: SKNet substitutes every grouped/ $3\times3$ convolution in ResNeXt-style bottleneck blocks (Li et al., 2019).
Semantic/U-net Style Segmentation: SK-Unet applies SK modules in decoder stages to improve spatial and channel selectivity for medical segmentation (Wang et al., 2020); LSKSANet introduces large depth-wise kernels and spatial mask gating in the decoder for remote-sensing segmentation (Fu et al., 2024).
Hyperspectral Processing: FSKNet integrates a two-branch deformable SK module post $3$D $\to2$ D conversion, resulting in substantial parameter and FLOP savings for HSI classification (Indian Pines, UP, Salinas, Botswana) (Li et al., 2022).
Time-Series and Speech: In speaker verification, SK modules are embedded in both the front end (channel/frequency SKA) and backbone (multi-scale SKA in Res2Net partitioned channels) to allow adaptive field selection in both channel and frequency axes (Mun et al., 2022).
Compound GNSS Interference: SKANet applies a three-branch SK with ACBs at varying dilations for robust multi-scale GNSS jamming classification (Zeng et al., 19 Jan 2026).

5. Empirical Results and Computational Characteristics

SK modules have demonstrated consistent empirical advantages:

Classification and Segmentation: On ImageNet-1K, SKNet-50 achieves lower top-1 error (20.79%) than SENet-50 (21.12%) and ResNeXt-50 (22.23%) with comparable parameters and FLOPs (Li et al., 2019). On medical segmentation, SK-Unet attains mean dice scores of 0.922 (LV), 0.827 (LVM), and 0.874 (RV), outperforming fixed-kernel architectures (Wang et al., 2020).
Efficiency: FSKNet achieves an order of magnitude reduction in parameter count and $10\times$ FLOPs reduction relative to dense 3D-CNNs, with only one SK module post-$3$D $\to2$ D conversion (Li et al., 2022).
Ablations: Removal of attention (replacing softmax by averaging) or deformable offsets leads to $0.5$– $1\%$ decrease in overall accuracy in hyperspectral tasks (Li et al., 2022).
Speaker Verification: SKA modules provide error rate reductions in ECAPA-CNN-TDNN and Res2Net-based models, attaining EER as low as 0.78% on VoxCeleb1-O with minimal parameter increase (Mun et al., 2022).
Remote Sensing: LSKSANet yields $+1.8$ mIoU improvement on Vaihingen dataset compared to UNetFormer, with only $0.2$M extra parameters (Fu et al., 2024).
GNSS Interference: SKANet attains 96.99% overall accuracy under low JNR conditions, demonstrating robustness for compound interference (Zeng et al., 19 Jan 2026).

6. Comparative Module Designs

Module	Branch Type	Selection/Gating	Domains of Use
SKNet	Conv (3/5/7), dilated	Channel softmax, SE	Classification, Segmentation
FSKNet	Deformable Conv (3/5)	SE, 2-branch softmax	Hyperspectral Image Class.
SK-Unet	Conv (3/5)	SE, 2-branch softmax	Cardiac MR Segmentation
LSKSANet	DW Conv (5,7)+1x1	Spatial mask (sigmoid)	Remote Sensing Segm.
SKANet	ACB, dilations (1/2/4)	Channel softmax	GNSS Interference Class.
SKA-SV	Conv2D (3/5)	Channel/frequency softmax	Speaker Verification

SE = Squeeze-and-Excitation, DW = Depth-wise, ACB = Asymmetric Convolution Block

7. Significance and Theoretical Considerations

The Multi-Branch SK Module framework is characterized by several key features:

Dynamic Receptive Field Adaptation: Neurons or channels adjust their spatial context on a per-instance and per-channel basis, as visualized in SKNet with shifting attention towards larger kernels for larger objects (Li et al., 2019).
Resource Efficiency: SK modules enhance representational power with modest parameter/FLOP increase and in some settings (e.g., FSKNet) enable dramatic model compression while retaining accuracy.
Generalizability: The SK paradigm has been successfully adapted across spatial and spectral domains, modalities (image, time-frequency), and convolutional primitives, underscoring its flexibility.
Limitations and Extensions: Branched modules without dynamic selection (e.g., MixModule (Yu et al., 2019)) still benefit from parallel kernels but lack the adaptive capacity of SK modules. Attention-based branch selection remains crucial to the empirical advantage of true SK designs.

In summary, the Multi-Branch Selective Kernel module provides a principled and empirically validated methodology for multi-scale feature aggregation with minimal cost, adaptable across architectures and modalities, and widely adopted in recent state-of-the-art deep learning systems (Li et al., 2019, Li et al., 2022, Fu et al., 2024, Wang et al., 2020, Mun et al., 2022, Zeng et al., 19 Jan 2026).