SEVector: Adaptive Feature Fusion
- SEVector is an adaptive, learnable reweighting mechanism that interpolates between complementary pooling outputs.
- It is mathematically defined as y = a ⊙ x₁ + (1 - a) ⊙ x₂, enabling flexible fusion across channel, spatial, or segmental axes.
- Empirical results in speech quality, object detection, and image recognition confirm significant performance improvements with SEVectors.
A Squeeze-and-Excitation Vector (SEVector) is not explicitly defined as a canonical module in the existing literature as of 2026, but it is often referenced in the context of pooling and attention strategies that adaptively reweight, combine, or recalibrate features across multiple axes (channel, spatial, segmental, etc.) using parameterized mechanisms. SEVectors serve as adaptive gating or fusion vectors that interpolate between complementary information streams before final aggregation. The following sections review the mathematical construction, operational context, and empirical role of SEVectors within modern dual-pooling architectures, synthesizing their appearance across statistics pooling in speech quality modeling, multi-granularity fusion for object detection, and channel-spatial recalibration in computer vision.
1. Conceptual Definition and General Framework
SEVectors act as learnable or adaptive weightings—typically vector-valued and differentiable—by which a neural network can merge distinct features, often arising from separate pooling paths (e.g., global vs. local, max vs. average, cluster- vs. grid-based). The archetype is a per-feature scalar or per-channel vector , trained via backpropagation to interpolate between pooled feature maps. This vector is either fed into elementwise multiplication gates or broadcast in linear combinations, supporting soft selection and fusion of information relevant to the downstream prediction objective.
A typical SEVector fusion appears as
where denotes elementwise multiplication, are feature maps from different pooling pathways, and is the SEVector learned to adaptively control the contribution of each path (see (Pan et al., 27 May 2025)).
2. Mathematical Formulation of SEVector Fusion
SEVectors are instantiated via several forms, depending on the architecture:
- Per-channel gates: , controlling channel-specific contributions (e.g., (Pan et al., 27 May 2025) Dual Pool Downscale Fusion).
- Scalar or vector combination weights: , linearly interpolating pooled statistics (e.g., DRASP global and attentive statistics, (Yang et al., 29 Aug 2025)).
- Spatially adaptive gates: per spatial location, governing local emphasis between complementary pooling kernels (e.g., adaPool, (Stergiou et al., 2021)).
A representative formalism from (Pan et al., 27 May 2025) is:
where and are outputs of max-pooling and average-pooling branches, respectively, and is a learnable fusion vector.
SEVectors need not be explicit network modules; rather, they arise as adaptive, parameterized vectors computed by lightweight neural layers (MLPs, convolutions, or direct parametrization) and may be initialized to encourage default behavior (e.g., initializes the system to rely on the global statistics pool, (Yang et al., 29 Aug 2025)).
3. SEVector Instantiations in Dual-Pooling and Attention Architectures
Dual Pool Downscale Fusion (DPDF)
In the YOLO-FireAD architecture (Pan et al., 27 May 2025), the DPDF block computes parallel max- and average-pooling over input features, processes each branch with shallow convolutions and attention, and merges them using a per-channel SEVector . This parameter vector is initialized at $0.5$ and refined during training. DPDF blocks demonstrate that maximally informative (e.g., edge or salient) and contextually smooth features can be adaptively fused, alleviating the typical trade-off between detail preservation and noise robustness.
Dual-Resolution Attentive Statistics Pooling
DRASP (Yang et al., 29 Aug 2025) employs scalar trainable coefficients as squeeze-and-excitation weights to mix coarse-grained global statistics and fine-grained segmental attentive statistics:
where and represent pooled statistics across the entire utterance and over salient segments, respectively. This allows the handling of both global structure and local detail in speech quality assessment.
Adaptive Pooling (adaPool)
In the adaPool framework (Stergiou et al., 2021), a spatially varying scalar is used to combine two pooling kernels (exponentiated Dice–Sorensen coefficient and exponential maximum), yielding a smoothly adjustable pooling behavior:
is either parameterized directly or via a sigmoid and updated to enhance local information retention or focus.
The following table summarizes representative SEVector forms:
| Architecture | SEVector Type | Role |
|---|---|---|
| DPDF (Pan et al., 27 May 2025) | Per-channel fusion of pooling branches | |
| DRASP (Yang et al., 29 Aug 2025) | Global-salient interpolation of statistics | |
| adaPool (Stergiou et al., 2021) | Per-region adaptive fusion of pooling kernels |
4. Training and Optimization of SEVectors
SEVectors are typically parameterized as free weights, possibly with constraints (sigmoid, softmax, or ReLU activations) as needed by the architecture. They are initialized to neutral or task-informed values and learned during regular supervised training by gradient descent methods (AdamW, SGD). In DPDF (Pan et al., 27 May 2025), is unconstrained within per channel and is updated with the rest of the model. For DRASP (Yang et al., 29 Aug 2025), are initialized as to prioritize fallback to global pooling and are gradually tuned as the model learns to benefit from fine-grained segmental information.
During backpropagation, SEVectors receive gradients that reflect their contribution to the loss relative to the alternative pooling branches, effectively gating their relative importance in an end-to-end fashion.
5. Practical Applications and Architectural Integration
SEVectors are integrated at various locations within modern deep architectures. In YOLO-FireAD (Pan et al., 27 May 2025), DPDF blocks with SEVectors replace standard stride-2 convolutions for downscaling at multiple backbone stages, and are also deployed before feature pyramids in the network neck. In DRASP (Yang et al., 29 Aug 2025), SEVectors control the fusion of global and segmental statistics, serving as drop-in replacements for any fixed pooling layer. PVAFN (Li et al., 2024) merges cluster and pyramid pool features by learned weighting, though the paper refers to "learned weighting" and not explicitly as an SEVector; however, the functionality aligns.
The training of SEVectors is not separated from the main loss; their adaptive role emerges as the model optimizes the end-task objective (e.g., mean opinion score, detection accuracy).
6. Empirical Performance and Observed Benefits
SEVector-based fusion yields consistent and measurable improvements across a wide range of evaluation tasks:
- Speech quality MOS prediction (DRASP): A +10.39% relative improvement in SRCC over average pooling, with ablation confirming the necessity of both global and segmental branches (Yang et al., 29 Aug 2025).
- Object detection (YOLO-FireAD): DPDF (with SEVector fusion) raises mAP by 1.7pp while reducing both parameters and computational cost by ~15% (Pan et al., 27 May 2025).
- Adaptive Pooling (adaPool): Top-1 accuracy improvements range from 1.2–2.5% across classification and action recognition, with clear gains in AP for object detection and PSNR for super-resolution (Stergiou et al., 2021).
A plausible implication is that SEVectors facilitate more flexible information routing in neural feature hierarchies, allowing models to dynamically adjust feature aggregation to match input characteristics, task demands, or data scale. This adaptability is especially effective in heterogeneous or multi-granular pooling scenarios.
7. Significance and Future Directions
The emergence of SEVectors formalizes a general principle of adaptive feature fusion in neural architectures. By enabling differentiable, data-driven selection or calibration between complementary information sources, SEVectors support robust generalization and improved information preservation without significant parameter or inference overhead.
Future research may explore hierarchical or multi-level SEVectors, spatially nonuniform variants, and integration into transformer or graph-based networks. The demonstrated empirical benefits across audio, vision, and multimodal domains support adaptation of SEVectors as a standard design pattern in pooling and attention mechanisms, particularly where balancing detail and context is critical.