MSE-Optimal Layer Fusion
- MSE-optimal layer fusion is a method that fuses neural network layers by minimizing the mean squared error between fused and reference outputs.
- It employs closed-form solutions such as LOT merging and OT-based alignment to merge linear, elementwise, and bias layers efficiently.
- Empirical results demonstrate enhanced performance in multi-task, multi-modal, and network compression settings compared to conventional averaging techniques.
MSE-optimal layer fusion refers to the family of algorithms, structural results, and analytic formulations that produce a single neural network layer (or set of layers) to minimize the mean squared error (MSE) between the network’s output—at a specific intermediate, embedding, or final layer—and the reference outputs of several networks, modules, or signal sources. Recent research has clarified both the structural optimality and the computational tractability of such fusion procedures for a variety of operations, architectures, and learning contexts including multi-task model merging, model compression, weight-space initialization, and multi-modal integration.
1. Formal Definitions and Mathematical Framework
The general problem of MSE-optimal layer fusion can be formulated as follows: Given a collection of networks or models —each possibly trained on a separate (task, data, or modality)—and a prescribed layerwise decomposition, seek a new set of weights for each layer such that the aggregate mean-squared error of the transformed activations (or outputs) is minimized across models and input data distributions.
Canonical objective: For input and some feature function (the representation at layer within expert ), define:
where is a matrix of calibration/example inputs at layer . The total drift to be minimized is
This yields a convex, quadratic problem per layer, typically with closed-form solutions for standard layer types, as demonstrated in (Sun et al., 29 May 2025, Ghods et al., 2020), and (Chen et al., 22 Aug 2025).
2. MSE-Optimal Fusion in Model Merging and Knowledge Integration
Layer-wise Optimal Task Vector Merging (LOT Merging): For model merging, especially in multi-task consolidation, the LOT Merging strategy proceeds by minimizing feature drift at each layer:
- For linear (matrix-multiply) layers: The function is , and the optimal update to the task vector is
where are the task vectors from fine-tuned experts, and denotes the Moore-Penrose pseudoinverse.
- For elementwise scale (Hadamard) layers (e.g., LayerNorm ):
- For bias (shift) layers:
This procedure, combining layerwise analytics and cheap matrix operations, enables efficient consolidation of fine-tuned models without retraining. Its formal convexity and solution structure are detailed in (Sun et al., 29 May 2025).
Connection to Theoretical Optimality
LOT Merging is MSE-optimal under an additive-drift model, which assumes that layerwise mean squared drift dominates multi-task loss transfer and that the loss surface locally preserves Lipschitz continuity in the feature drift. The per-layer minimization guarantees the unique minimizer for the sum-squared-drift objective (Sun et al., 29 May 2025).
3. Fusion via Optimal Transport and Neuron/Filter Alignment
An alternative and influential approach involves neuron- or channel-alignment via optimal transport (OT) before parameter averaging (Singh et al., 2019). This method is crucial in arbitrary-width networks and networks with permutation-symmetric layers.
- Problem formulation: At layer , for each model with weight , find soft assignment matrices (relaxing permutation matrices to the set of doubly stochastic matrices via entropic regularization). Then compute the MSE-optimal fused weights as
- Sinkhorn Algorithm: Soft alignments are found via the Sinkhorn-Knopp iterative algorithm over a cost matrix based on parameter distance or activation dissimilarity. Once the assignments are computed, the aligned weights are averaged.
This layerwise OT-based fusion is MSE-optimal among all matched/soft-aligned reparameterizations (Singh et al., 2019).
4. Fusion in Multi-Modal, Multi-Embedding, and Multi-Layer Settings
MSE-optimal fusion extends beyond classical parameter merging to scenarios involving modality integration or structural compression.
Multi-layer Steerable Embedding Fusion (MSEF): In time-series forecasting with LLMs, MSEF injects layer-specific “steering vectors” and time-series embeddings at every Transformer block, constructing the per-layer fused input by concatenation:
Only these layerwise steering vectors and the terminal output layer are trained, with all backbone weights frozen. The objective is MSE on predicted series, and repeated multi-layer fusion is shown to dramatically improve preservation of time-series information throughout the model (Chen et al., 22 Aug 2025).
5. MSE-Optimal Layer Fusion for Network Compression and Initialization
The layer-fusion framework also appears in network shrinking, where two adjacent layers are replaced by a single layer mapping, minimizing MSE over a finite dataset (Ghods et al., 2020):
- Dense Dense: Fuse two FC layers then , into a single FC layer, with
where is the covariance between post- and pre-activation features, and are empirical means.
- Convolutional variants: Analogous closed forms exist for convconv or convdense, using cross-covariance contractions and channel-wise Toeplitz matrices.
When used as an initialization (e.g., FuseInit), this approach yields shallower networks that attain or surpass the performance of their deeper teachers after modest fine-tuning (Ghods et al., 2020).
6. Empirical Performance and Computational Considerations
Across vision (ViT-B/32, ViT-L/14), vision-language (BLIP), and time-series tasks, MSE-optimal layer fusion substantially improves performance versus previous non-training or naive averaging baselines:
- LOT Merging: On ViT-B/32, average accuracy versus for the best baseline; on ViT-L/14, versus (Sun et al., 29 May 2025).
- OT Fusion: On CIFAR10 with VGG11, OT fusion achieves accuracy one-shot, compared to for vanilla averaging and for ensembles (Singh et al., 2019).
- MSEF: Mean MSE reduction of versus the best-external TSF model on long-term forecasting (Chen et al., 22 Aug 2025).
These methods are also computationally efficient—e.g., for LOT merging, merging eight ViT-B/32 task vectors takes 44 seconds on a single RTX 3090 (Sun et al., 29 May 2025).
7. Limitations and Extensions
While the theoretical guarantees of MSE-optimal layer fusion are rigorous under specific data/model assumptions (convexity, channel independence, data representativity), practical limitations include:
- Representational mismatch: Fusion is optimal for the empirical, calibration-set MSE, not necessarily for downstream task loss.
- Covariance estimation: Some methods require well-conditioned covariance matrices, enforceable via regularization.
- Architectural restrictions: Extension to highly non-linear layers or non-standard modules may be non-trivial.
- Dependence on calibration/sample diversity: The quality of fusion is sensitive to how representative the calibration set or support batch is for the data domain.
Nonetheless, these layerwise, MSE-optimal procedures constitute a principled, scalable, and theoretically sound toolkit for multi-model, multi-task, and multi-modal integration, with extensive demonstration on state-of-the-art deep architectures (Sun et al., 29 May 2025, Singh et al., 2019, Chen et al., 22 Aug 2025, Ghods et al., 2020).