MSE-Optimal Layer Fusion

Updated 21 January 2026

MSE-optimal layer fusion is a method that fuses neural network layers by minimizing the mean squared error between fused and reference outputs.
It employs closed-form solutions such as LOT merging and OT-based alignment to merge linear, elementwise, and bias layers efficiently.
Empirical results demonstrate enhanced performance in multi-task, multi-modal, and network compression settings compared to conventional averaging techniques.

MSE-optimal layer fusion refers to the family of algorithms, structural results, and analytic formulations that produce a single neural network layer (or set of layers) to minimize the mean squared error (MSE) between the network’s output—at a specific intermediate, embedding, or final layer—and the reference outputs of several networks, modules, or signal sources. Recent research has clarified both the structural optimality and the computational tractability of such fusion procedures for a variety of operations, architectures, and learning contexts including multi-task model merging, model compression, weight-space initialization, and multi-modal integration.

1. Formal Definitions and Mathematical Framework

The general problem of MSE-optimal layer fusion can be formulated as follows: Given a collection of $K$ networks or models $\{W_k\}_{k=1}^K$ —each possibly trained on a separate (task, data, or modality)—and a prescribed layerwise decomposition, seek a new set of weights $\widehat{W}^{(\ell)}$ for each layer $\ell$ such that the aggregate mean-squared error of the transformed activations (or outputs) is minimized across models and input data distributions.

Canonical objective: For input $x$ and some feature function $f_k^{\ell}(W; x)$ (the representation at layer $\ell$ within expert $k$ ), define:

$\Delta f_k^{\ell} = f_k^{\ell}(W_{fused}; X_k^{\ell}) - f_k^{\ell}(W_{k}; X_k^{\ell})$

where $X_k^{\ell}$ is a matrix of calibration/example inputs at layer $\ell$ . The total drift to be minimized is

$J^{\ell}(T^{\ell}) = \sum_{k=1}^K \left\| \Delta f_k^{\ell} \right\|_F^2 = \sum_{k=1}^K \|f_k^{\ell}(W_{fused}; X_k^{\ell}) - f_k^{\ell}(W_{k}; X_k^{\ell})\|_F^2$

This yields a convex, quadratic problem per layer, typically with closed-form solutions for standard layer types, as demonstrated in (Sun et al., 29 May 2025, Ghods et al., 2020), and (Chen et al., 22 Aug 2025).

2. MSE-Optimal Fusion in Model Merging and Knowledge Integration

Layer-wise Optimal Task Vector Merging (LOT Merging): For model merging, especially in multi-task consolidation, the LOT Merging strategy proceeds by minimizing feature drift at each layer:

For linear (matrix-multiply) layers: The function is $f_k^{\ell}(W) = X_k^{\ell} W^{\ell}$ , and the optimal update $T^{\ell *}$ to the task vector is

$T^{\ell *} = \Big(\sum_k (X_k^{\ell})^\top X_k^{\ell} \Big)^{\dagger} \sum_k (X_k^{\ell})^\top X_k^{\ell} T_k^{\ell}$

where $T_k^{\ell}$ are the task vectors from fine-tuned experts, and $\dagger$ denotes the Moore-Penrose pseudoinverse.

For elementwise scale (Hadamard) layers (e.g., LayerNorm $\gamma$ ):

$T^{\ell *}[d] = \frac{ \sum_{k, x} x[d]^2 T_k^{\ell}[d] }{ \sum_{k, x} x[d]^2 }$

For bias (shift) layers:

$T^{\ell *} = \frac{1}{K} \sum_{k=1}^K T_k^{\ell}$

This procedure, combining layerwise analytics and cheap matrix operations, enables efficient consolidation of fine-tuned models without retraining. Its formal convexity and solution structure are detailed in (Sun et al., 29 May 2025).

Connection to Theoretical Optimality

LOT Merging is MSE-optimal under an additive-drift model, which assumes that layerwise mean squared drift dominates multi-task loss transfer and that the loss surface locally preserves Lipschitz continuity in the feature drift. The per-layer minimization guarantees the unique minimizer for the sum-squared-drift objective (Sun et al., 29 May 2025).

3. Fusion via Optimal Transport and Neuron/Filter Alignment

An alternative and influential approach involves neuron- or channel-alignment via optimal transport (OT) before parameter averaging (Singh et al., 2019). This method is crucial in arbitrary-width networks and networks with permutation-symmetric layers.

Problem formulation: At layer $\ell$ , for each model $k$ with weight $W^{(k)}$ , find soft assignment matrices $T^{(k)}$ (relaxing permutation matrices to the set of doubly stochastic matrices via entropic regularization). Then compute the MSE-optimal fused weights as

$\widehat{W}^{(\ell)} = \frac{1}{K} \sum_{k=1}^K W^{(k)} T^{(k)}$

Sinkhorn Algorithm: Soft alignments are found via the Sinkhorn-Knopp iterative algorithm over a cost matrix based on parameter distance or activation dissimilarity. Once the assignments are computed, the aligned weights are averaged.

This layerwise OT-based fusion is MSE-optimal among all matched/soft-aligned reparameterizations (Singh et al., 2019).

MSE-optimal fusion extends beyond classical parameter merging to scenarios involving modality integration or structural compression.

Multi-layer Steerable Embedding Fusion (MSEF): In time-series forecasting with LLMs, MSEF injects layer-specific “steering vectors” and time-series embeddings at every Transformer block, constructing the per-layer fused input by concatenation:

$F^{(\ell)} = [ s^{(\ell)} ; e_{TS}^{(\ell)} ; h^{(\ell-1)} ]$

Only these layerwise steering vectors and the terminal output layer are trained, with all backbone weights frozen. The objective is MSE on predicted series, and repeated multi-layer fusion is shown to dramatically improve preservation of time-series information throughout the model (Chen et al., 22 Aug 2025).

5. MSE-Optimal Layer Fusion for Network Compression and Initialization

The layer-fusion framework also appears in network shrinking, where two adjacent layers are replaced by a single layer mapping, minimizing MSE over a finite dataset (Ghods et al., 2020):

Dense $\to$ Dense: Fuse two FC layers $H_1(x)$ then $W_2$ , $b_2$ into a single FC layer, with

$W^* = W_2 C_{a_1,x} C_x^{-1}$

$b^* = W_2 \mu_1 + b_2 - W^* \mu_0$

where $C_{a_1,x}$ is the covariance between post- and pre-activation features, and $\mu_0, \mu_1$ are empirical means.

Convolutional variants: Analogous closed forms exist for conv $\to$ conv or conv $\to$ dense, using cross-covariance contractions and channel-wise Toeplitz matrices.

When used as an initialization (e.g., FuseInit), this approach yields shallower networks that attain or surpass the performance of their deeper teachers after modest fine-tuning (Ghods et al., 2020).

6. Empirical Performance and Computational Considerations

Across vision (ViT-B/32, ViT-L/14), vision-language (BLIP), and time-series tasks, MSE-optimal layer fusion substantially improves performance versus previous non-training or naive averaging baselines:

LOT Merging: On ViT-B/32, average accuracy $82.7\%$ versus $78.3\%$ for the best baseline; on ViT-L/14, $90.5\%$ versus $89.6\%$ (Sun et al., 29 May 2025).
OT Fusion: On CIFAR10 with VGG11, OT fusion achieves $86.0\%$ accuracy one-shot, compared to $17.0\%$ for vanilla averaging and $91.34\%$ for ensembles (Singh et al., 2019).
MSEF: Mean MSE reduction of $31.8\%$ versus the best-external TSF model on long-term forecasting (Chen et al., 22 Aug 2025).

These methods are also computationally efficient—e.g., for LOT merging, merging eight ViT-B/32 task vectors takes 44 seconds on a single RTX 3090 (Sun et al., 29 May 2025).

7. Limitations and Extensions

While the theoretical guarantees of MSE-optimal layer fusion are rigorous under specific data/model assumptions (convexity, channel independence, data representativity), practical limitations include:

Representational mismatch: Fusion is optimal for the empirical, calibration-set MSE, not necessarily for downstream task loss.
Covariance estimation: Some methods require well-conditioned covariance matrices, enforceable via regularization.
Architectural restrictions: Extension to highly non-linear layers or non-standard modules may be non-trivial.
Dependence on calibration/sample diversity: The quality of fusion is sensitive to how representative the calibration set or support batch is for the data domain.

Nonetheless, these layerwise, MSE-optimal procedures constitute a principled, scalable, and theoretically sound toolkit for multi-model, multi-task, and multi-modal integration, with extensive demonstration on state-of-the-art deep architectures (Sun et al., 29 May 2025, Singh et al., 2019, Chen et al., 22 Aug 2025, Ghods et al., 2020).

Markdown Report Issue Upgrade to Chat

References (4)

Towards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration (2025)

MSE-Optimal Neural Network Initialization via Layer Fusion (2020)

Integrating Time Series into LLMs via Multi-layer Steerable Embedding Fusion for Enhanced Forecasting (2025)

Model Fusion via Optimal Transport (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MSE-Optimal Layer Fusion.