Papers
Topics
Authors
Recent
Search
2000 character limit reached

MSE-Optimal Layer Fusion

Updated 21 January 2026
  • MSE-optimal layer fusion is a method that fuses neural network layers by minimizing the mean squared error between fused and reference outputs.
  • It employs closed-form solutions such as LOT merging and OT-based alignment to merge linear, elementwise, and bias layers efficiently.
  • Empirical results demonstrate enhanced performance in multi-task, multi-modal, and network compression settings compared to conventional averaging techniques.

MSE-optimal layer fusion refers to the family of algorithms, structural results, and analytic formulations that produce a single neural network layer (or set of layers) to minimize the mean squared error (MSE) between the network’s output—at a specific intermediate, embedding, or final layer—and the reference outputs of several networks, modules, or signal sources. Recent research has clarified both the structural optimality and the computational tractability of such fusion procedures for a variety of operations, architectures, and learning contexts including multi-task model merging, model compression, weight-space initialization, and multi-modal integration.

1. Formal Definitions and Mathematical Framework

The general problem of MSE-optimal layer fusion can be formulated as follows: Given a collection of KK networks or models {Wk}k=1K\{W_k\}_{k=1}^K—each possibly trained on a separate (task, data, or modality)—and a prescribed layerwise decomposition, seek a new set of weights W^()\widehat{W}^{(\ell)} for each layer \ell such that the aggregate mean-squared error of the transformed activations (or outputs) is minimized across models and input data distributions.

Canonical objective: For input xx and some feature function fk(W;x)f_k^{\ell}(W; x) (the representation at layer \ell within expert kk), define:

Δfk=fk(Wfused;Xk)fk(Wk;Xk)\Delta f_k^{\ell} = f_k^{\ell}(W_{fused}; X_k^{\ell}) - f_k^{\ell}(W_{k}; X_k^{\ell})

where XkX_k^{\ell} is a matrix of calibration/example inputs at layer \ell. The total drift to be minimized is

J(T)=k=1KΔfkF2=k=1Kfk(Wfused;Xk)fk(Wk;Xk)F2J^{\ell}(T^{\ell}) = \sum_{k=1}^K \left\| \Delta f_k^{\ell} \right\|_F^2 = \sum_{k=1}^K \|f_k^{\ell}(W_{fused}; X_k^{\ell}) - f_k^{\ell}(W_{k}; X_k^{\ell})\|_F^2

This yields a convex, quadratic problem per layer, typically with closed-form solutions for standard layer types, as demonstrated in (Sun et al., 29 May 2025, Ghods et al., 2020), and (Chen et al., 22 Aug 2025).

2. MSE-Optimal Fusion in Model Merging and Knowledge Integration

Layer-wise Optimal Task Vector Merging (LOT Merging): For model merging, especially in multi-task consolidation, the LOT Merging strategy proceeds by minimizing feature drift at each layer:

  • For linear (matrix-multiply) layers: The function is fk(W)=XkWf_k^{\ell}(W) = X_k^{\ell} W^{\ell}, and the optimal update TT^{\ell *} to the task vector is

T=(k(Xk)Xk)k(Xk)XkTkT^{\ell *} = \Big(\sum_k (X_k^{\ell})^\top X_k^{\ell} \Big)^{\dagger} \sum_k (X_k^{\ell})^\top X_k^{\ell} T_k^{\ell}

where TkT_k^{\ell} are the task vectors from fine-tuned experts, and \dagger denotes the Moore-Penrose pseudoinverse.

  • For elementwise scale (Hadamard) layers (e.g., LayerNorm γ\gamma):

T[d]=k,xx[d]2Tk[d]k,xx[d]2T^{\ell *}[d] = \frac{ \sum_{k, x} x[d]^2 T_k^{\ell}[d] }{ \sum_{k, x} x[d]^2 }

  • For bias (shift) layers:

T=1Kk=1KTkT^{\ell *} = \frac{1}{K} \sum_{k=1}^K T_k^{\ell}

This procedure, combining layerwise analytics and cheap matrix operations, enables efficient consolidation of fine-tuned models without retraining. Its formal convexity and solution structure are detailed in (Sun et al., 29 May 2025).

Connection to Theoretical Optimality

LOT Merging is MSE-optimal under an additive-drift model, which assumes that layerwise mean squared drift dominates multi-task loss transfer and that the loss surface locally preserves Lipschitz continuity in the feature drift. The per-layer minimization guarantees the unique minimizer for the sum-squared-drift objective (Sun et al., 29 May 2025).

3. Fusion via Optimal Transport and Neuron/Filter Alignment

An alternative and influential approach involves neuron- or channel-alignment via optimal transport (OT) before parameter averaging (Singh et al., 2019). This method is crucial in arbitrary-width networks and networks with permutation-symmetric layers.

  • Problem formulation: At layer \ell, for each model kk with weight W(k)W^{(k)}, find soft assignment matrices T(k)T^{(k)} (relaxing permutation matrices to the set of doubly stochastic matrices via entropic regularization). Then compute the MSE-optimal fused weights as

W^()=1Kk=1KW(k)T(k)\widehat{W}^{(\ell)} = \frac{1}{K} \sum_{k=1}^K W^{(k)} T^{(k)}

  • Sinkhorn Algorithm: Soft alignments are found via the Sinkhorn-Knopp iterative algorithm over a cost matrix based on parameter distance or activation dissimilarity. Once the assignments are computed, the aligned weights are averaged.

This layerwise OT-based fusion is MSE-optimal among all matched/soft-aligned reparameterizations (Singh et al., 2019).

4. Fusion in Multi-Modal, Multi-Embedding, and Multi-Layer Settings

MSE-optimal fusion extends beyond classical parameter merging to scenarios involving modality integration or structural compression.

Multi-layer Steerable Embedding Fusion (MSEF): In time-series forecasting with LLMs, MSEF injects layer-specific “steering vectors” and time-series embeddings at every Transformer block, constructing the per-layer fused input by concatenation:

F()=[s();eTS();h(1)]F^{(\ell)} = [ s^{(\ell)} ; e_{TS}^{(\ell)} ; h^{(\ell-1)} ]

Only these layerwise steering vectors and the terminal output layer are trained, with all backbone weights frozen. The objective is MSE on predicted series, and repeated multi-layer fusion is shown to dramatically improve preservation of time-series information throughout the model (Chen et al., 22 Aug 2025).

5. MSE-Optimal Layer Fusion for Network Compression and Initialization

The layer-fusion framework also appears in network shrinking, where two adjacent layers are replaced by a single layer mapping, minimizing MSE over a finite dataset (Ghods et al., 2020):

  • Dense \to Dense: Fuse two FC layers H1(x)H_1(x) then W2W_2, b2b_2 into a single FC layer, with

W=W2Ca1,xCx1W^* = W_2 C_{a_1,x} C_x^{-1}

b=W2μ1+b2Wμ0b^* = W_2 \mu_1 + b_2 - W^* \mu_0

where Ca1,xC_{a_1,x} is the covariance between post- and pre-activation features, and μ0,μ1\mu_0, \mu_1 are empirical means.

  • Convolutional variants: Analogous closed forms exist for conv\toconv or conv\todense, using cross-covariance contractions and channel-wise Toeplitz matrices.

When used as an initialization (e.g., FuseInit), this approach yields shallower networks that attain or surpass the performance of their deeper teachers after modest fine-tuning (Ghods et al., 2020).

6. Empirical Performance and Computational Considerations

Across vision (ViT-B/32, ViT-L/14), vision-language (BLIP), and time-series tasks, MSE-optimal layer fusion substantially improves performance versus previous non-training or naive averaging baselines:

  • LOT Merging: On ViT-B/32, average accuracy 82.7%82.7\% versus 78.3%78.3\% for the best baseline; on ViT-L/14, 90.5%90.5\% versus 89.6%89.6\% (Sun et al., 29 May 2025).
  • OT Fusion: On CIFAR10 with VGG11, OT fusion achieves 86.0%86.0\% accuracy one-shot, compared to 17.0%17.0\% for vanilla averaging and 91.34%91.34\% for ensembles (Singh et al., 2019).
  • MSEF: Mean MSE reduction of 31.8%31.8\% versus the best-external TSF model on long-term forecasting (Chen et al., 22 Aug 2025).

These methods are also computationally efficient—e.g., for LOT merging, merging eight ViT-B/32 task vectors takes 44 seconds on a single RTX 3090 (Sun et al., 29 May 2025).

7. Limitations and Extensions

While the theoretical guarantees of MSE-optimal layer fusion are rigorous under specific data/model assumptions (convexity, channel independence, data representativity), practical limitations include:

  • Representational mismatch: Fusion is optimal for the empirical, calibration-set MSE, not necessarily for downstream task loss.
  • Covariance estimation: Some methods require well-conditioned covariance matrices, enforceable via regularization.
  • Architectural restrictions: Extension to highly non-linear layers or non-standard modules may be non-trivial.
  • Dependence on calibration/sample diversity: The quality of fusion is sensitive to how representative the calibration set or support batch is for the data domain.

Nonetheless, these layerwise, MSE-optimal procedures constitute a principled, scalable, and theoretically sound toolkit for multi-model, multi-task, and multi-modal integration, with extensive demonstration on state-of-the-art deep architectures (Sun et al., 29 May 2025, Singh et al., 2019, Chen et al., 22 Aug 2025, Ghods et al., 2020).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MSE-Optimal Layer Fusion.