MSFA: Multi-Scale Fusion Adapter

Updated 5 January 2026

The paper introduces MSFA, an ODE-based multi-step fusion module that improves U-Net decoders through adaptive predictor–corrector schemes.
MSFA redefines feature fusion by treating encoder outputs as IVP samples and applying linear multistep ODE solvers for enhanced multi-scale interactions.
Empirical results demonstrate that FuseUNet with MSFA achieves reduced parameters and GFLOPS while maintaining or improving Dice scores across medical imaging benchmarks.

The Multi-Scale Fusion Adapter (MSFA) is a decoder-stage architectural module designed for U-Net and U-like networks to facilitate explicit, high-order, multi-scale feature integration within the decoder. Unlike standard skip connections based on concatenation or addition, MSFA treats the sequence of encoder features as samples in an initial value problem (IVP), employing numerical ordinary differential equation (ODE) solvers—specifically linear multistep methods—to propagate fused information across decoder stages. This approach, exemplified in FuseUNet, leads to improved utilization of multi-scale features, reduced network parameter counts, and maintained or improved segmentation performance across diverse medical imaging benchmarks (He et al., 6 Jun 2025).

1. Motivation and Limitations of Conventional Skip Connections

In conventional U-Net architectures, skip connections provide only same-scale feature fusion at each decoder stage $i$ , using the current encoder feature $X_i$ and the previous decoder state $Y_{i-1}$ :

$Y_i = \text{DecoderStep}(X_i, Y_{i-1})$

These connections are typically implemented by simple addition or concatenation, which can be mathematically interpreted as a first-order explicit Euler method. Such limited “accuracy” in integrating hierarchical features leads to several drawbacks:

Incomplete Multi-Scale Information Utilization: No cross-scale interaction is performed, causing under-utilization of contextual information from coarser or finer encoder levels.
Parameter Inefficiency: Redundant context must be re-learned within the decoder due to incomplete feature integration. The MSFA directly addresses these deficiencies by framing feature fusion as a multi-step process governed by ODE-based modeling (He et al., 6 Jun 2025).

2. Theoretical Foundation: IVP and Linear Multistep Methods

MSFA reconceptualizes decoder computation as the temporal evolution of latent states in an IVP, using multi-scale encoder outputs $\{X_1, X_2, ..., X_L\}$ as temporal samples. The continuous form is:

$\frac{dY(t)}{dt} = F(Y(t), X(t)), \quad Y(t_0) = 0$

The instantiated MSFA ODE is:

$\frac{dY(t)}{dt} = -Y(t) + f\big(Y(t) + g(X(t))\big)$

where $g(\cdot)$ aligns encoder features to the memory space and $f(\cdot)$ is an adaptable nonlinear operator (e.g., conv-block, transformer block, or Mamba block). Time discretization is achieved by a $k$ -step linear multistep method with step size $h$ :

$Y_{n+1} \approx Y_n + h \cdot \sum_{i=0}^k b_i F(Y_{n-i}, X_{n-i})$

Two classical methods are used:

Adams-Bashforth (AB- $k$ ): Explicit, $b_k = 0$ .
Adams-Moulton (AM- $k$ ): Implicit, $b_k \neq 0$ . The MSFA includes an adaptive predictor–corrector scheme combining AB and AM methods per stage, with $k$ up to 4 for stability (He et al., 6 Jun 2025).

3. MSFA Module Architecture

At each decoder level $i = 1 \ldots L$ , the MSFA operates as follows:

Inputs: Encoder feature $X_i \in \mathbb{R}^{C_i \times H_i \times W_i}$ , previous memory state $Y_i \in \mathbb{R}^{C_m \times H_i \times W_i}$ (with $C_m = 2 \cdot N_\text{classes}$ ).
Alignment ( $g$ -function): Project $X_i$ to $C_m$ channels with a $1 \times 1$ convolution and upsample/downsample spatially to match $Y_i$ .
Nonlinear Update ( $f$ -function): Typically two sequential Conv3 $\times$ 3 $\to$ BatchNorm $\to$ ReLU blocks (or Transformer/Mamba block for non-CNN backbones) with optional residual connection.
nmODEs Block: Computes $F_i = -Y_i + f(Y_i + g(X_i))$ .
Predictor–Corrector Fusion (per Algorithm 1 in (He et al., 6 Jun 2025)):
- For $i < 4$ , use $i$ -step AB and $i$ -step AM.
- Else, use AB-4 as predictor and AM-3 as corrector.
Final Decoder Output: Top stage uses explicit AB- $k$ ( $k = \min(4, L)$ ) to produce $Y_\text{final}$ .
Segmentation Head: $1\times1$ convolution maps $C_m \to N_\text{classes}$ , followed by softmax.

4. Integration into U-Net and Encoder-Agnostic Design

MSFA replaces only skip connections and decoder logic. It remains agnostic to the encoder architecture (CNN, Transformer, Mamba). Channel alignment is ensured by mapping $X_i$ to $2\times N_\text{classes}$ channels before fusion. Integration requires storing up to $k=4$ past $F_i$ and $Y_i$ states for multi-step fusion, which may increase VRAM overhead. Training protocols and learning rates typically match the encoder backbone, though a $2$– $3\times$ higher decoder learning rate is recommended if the decoder is substantially smaller in parameter count (He et al., 6 Jun 2025).

5. Empirical Results, Benchmarks, and Ablations

Empirical evaluation across various datasets and backbones demonstrates that MSFA enables parameter- and compute-efficient segmentation without loss of accuracy. Representative metrics are summarized below (Dice averages for key 3D datasets):

Dataset	Backbone	Params	GFLOPS	Dice Avg
ACDC	nn-UNet	31.2 M	402.6	91.54
	FuseUNet (MSFA)	14.0 M	264.9	91.57
KiTS2023	nn-UNet	31.2 M	402.6	86.04
	FuseUNet (MSFA)	14.0 M	264.9	86.19
MSD brain	UNETR	103.7 M	40.3	71.1
tumor	FuseUNet (MSFA)	89.2 M	20.1	72.6

Ablation studies reveal:

Increasing order $k$ from 1 to 4 yields a $\sim10\%$ normalized Dice gain.
Setting memory channels $C_m=2N_\text{classes}$ provides optimal trade-off.
No significant degradation versus backbone network in paired t-tests ( $p > 0.5$ ).

6. Practical Implementation Details and Recommendations

Step Size: Fixed at $h=1$ between adjacent encoder stages.
Order Hyperparameter: Use $k=\min(4, i)$ at each stage to balance stability and memory.
Parameter Overhead: Each stage introduces one $1\times1$ conv, two $3\times3$ convs, batchnorms, and the weight-free summations of multistep solvers.
Training: Loss combines Dice and cross-entropy. Optimizers follow backbone defaults (SGD with momentum or AdamW for Transformers).
Extensions: Gating or attention mechanisms can be inserted in the $f$ -function for further enhancement. A plausible implication is that ODE-style, multi-stage fusion can benefit architectures beyond medical image segmentation, wherever multi-scale hierarchical feature integration is relevant (He et al., 6 Jun 2025).

7. Comparative Analysis and Significance

Classical fusion approaches in hybrid CNN-Transformer designs typically limit interactions to either one-off concatenation or naive addition, lacking explicit multi-step, multi-scale feature propagation. MSFA provides adaptive, high-order integration throughout the decoder, enabling richer and more efficient aggregation of hierarchical information. This methodological advance results in substantial reductions in parameters (e.g., –54.9% for nn-UNet) and GFLOPS (down to –34.3%), with no loss—and sometimes improvement—in target performance metrics. These outcomes substantiate the utility and transferability of MSFA modules for contemporary U-Net-like segmentation frameworks (He et al., 6 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

FuseUNet: A Multi-Scale Feature Fusion Method for U-like Networks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Scale Fusion Adapter (MSFA).