Model Merging via Multi-Teacher Knowledge Distillation

Published 24 Dec 2025 in cs.LG and cs.AI | (2512.21288v1)

Abstract: Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model's contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address this gap by (i) establishing a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting. This analysis introduces a "cross-task heterogeneity" term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions. Guided by this theoretical insight, (ii) we frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data. We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk. Guided by the flatness-aware bound derived, (iii) we operationalize this objective via SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima. Empirically, SAMerging establishes a new state of the art across vision and NLP benchmarks, achieving remarkable performance. The code is available at https://github.com/arshandalili/SAMerging.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a formal PAC-Bayes framework that decomposes merged model risk into per-task terms, loss flatness, and task heterogeneity.
It presents SAMerging, which leverages multi-teacher knowledge distillation and Sharpness-Aware Minimization to robustly merge fine-tuned models using scarce calibration data.
Empirical evaluations across CV and NLP benchmarks demonstrate that SAMerging outperforms existing methods in accuracy and data efficiency under heterogeneous tasks.

Model Merging via Multi-Teacher Knowledge Distillation: A Formal and Empirical Perspective

Introduction and Context

This paper addresses the under-explored model merging scenario in multi-task learning (MTL), where multiple fine-tuned models, each trained on a potentially heterogeneous task, are fused into a single backbone. The motivation is to obtain inference-time and memory-efficient multitask models, especially in scenarios where access to the original training data is limited or infeasible. Existing model merging techniques typically use heuristic or ad hoc strategies for setting merge coefficients, and their generalization guarantees remain ill-understood. This work provides a rigorous theoretical framework for model merging and introduces the SAMerging method, which explicitly optimizes for generalization by minimizing both cross-model and loss landscape sharpness.

Theoretical Contributions

A central contribution of the paper is the derivation of a PAC-Bayes generalization bound tailored to model merging. This result formally decomposes merged model risk into per-task PAC-Bayes terms, a loss flatness term, and, critically, a novel cross-task heterogeneity term. This last term quantifies the divergence between task priors/posteriors and the gap between merging fine-tuned models versus joint multi-task training.

The analysis leverages neural tangent kernel (NTK) approximations to connect posterior-level guarantees to concrete, single-parameter merged models. The theoretical framework clarifies three failure/drift regimes:

Sharpness: If fine-tuned models or the merged model itself occupy sharp minima, generalization degrades.
Task Heterogeneity: As fine-tuned models become more heterogeneous (large divergence between priors/posteriors), the heterogeneity term dominates, and merging coefficients must downweight outliers.
Dispersion: Large-weight, distant task vectors increase kernel-weighted penalties, predisposing the merged solution to instability.

This treatment advances prior work [e.g., (Haddouche et al., 2024)], which does not address post-hoc merging, by providing actionable surrogate objectives and algorithmic design choices.

Methodology: Knowledge Distillation and Flatness-Aware Optimization

The authors cast model merging as a multi-teacher knowledge distillation (KD) problem: A student (merged) model aims to minimize the average KL divergence between its output distribution and that of each teacher (task-specific model) on scarce, unlabeled calibration data. This formulation yields an explicit excess-risk guarantee (Theorem 3): minimizing the KD loss provably controls the merged model's average 0-1 risk with respect to the set of fine-tuned teachers.

On top of this, the paper deploys Sharpness-Aware Minimization (SAM) during merging. By seeking solutions that are robust to parameter perturbations in the merge coefficient space, SAMerging reliably finds flat minima, thus simultaneously contracting the generalization bound's flatness and cross-task heterogeneity terms.

The resulting SAMerging procedure (Algorithm 1) iteratively searches for layer-wise coefficients that minimize the KD loss in a SAM neighborhood, with initialization and regularization protocols informed by NTK-region bounds.

Empirical Validation

The paper conducts a comprehensive empirical evaluation across CV (TA-8/TALL-14/TALL-20—multiple image classification benchmarks using CLIP ViT backbones) and NLP (GLUE benchmark using GPT-2 and DeBERTa-V2-XXL). Competing baselines include simple averaging (Wortsman et al., 2022), task arithmetic (Ilharco et al., 2022), TIES-Merging (Yadav et al., 2023), Fisher Merging [Matena & Raffel, NeurIPS 2022], RegMean/RegMean++ (Jin et al., 2022, Nguyen et al., 5 Aug 2025), AdaMerging [Yang et al., 2023], and isotropic/PCB merging.

Key empirical findings:

SAMerging achieves the strongest average and normalized accuracy on all major benchmarks and is robust to increasing task heterogeneity and task count.
SAMerging reaches state-of-the-art performance using only 16-1600 unlabeled calibration examples per task, whereas competing methods such as AdaMerging require an order of magnitude more examples to approach optimality. This data efficiency is particularly significant for practical use.
Both ablations and loss landscape visualizations support the theoretical claims: removing the KL-distillation objective or SAM from the optimizer each results in measurable performance degradation, and the merged solution produced by SAMerging exhibits a substantially broader low-loss region relative to alternatives such as AdaMerging.
On heterogeneous NLP tasks, data-dependent baselines that do not optimize for explicit task alignment or loss flatness (e.g., AdaMerging with entropy minimization) trail behind data-free baselines and SAMerging, highlighting the brittleness of naive data-dependent adaptation.

Implications and Direction for Future Research

The presented PAC-Bayes analysis enables, for the first time, a principled selection of merge coefficients and provides diagnostic tools to anticipate and mitigate merging failures due to task interference. The results establish multi-teacher knowledge distillation as a theoretically sound and practically effective paradigm for merging arbitrarily fine-tuned models in data-scarce, privacy-sensitive regimes.

By incorporating flatness-aware objectives, the paper bridges the gap between theory (PAC-Bayes, landscape analysis, NTK approximations) and scalable, robust implementation, and demonstrates the significance of geometric factors in multi-task and federated model merging.

Practical ramifications are notable for deployment on resource-constrained and privacy-sensitive platforms, such as federated learning, where raw data cannot be pooled and only small calibration sets may be available.

Future research directions include:

Extending the merging framework to heavily conflicting or highly non-overlapping label spaces and multi-label settings, which may necessitate more sophisticated coefficient regularization or alternative posterior aggregation schemes.
Generalizing the framework to non-classification settings (e.g., generative modeling), where calibration and distillation objectives are non-trivial.
Exploring lightweight or alternative proxies for flatness-awareness to reduce calibration-time computational overhead.
Investigating merging behavior outside of the NTK linearization regime and addressing theoretical behavior for far-from-initialization merging.

Conclusion

This work formally characterizes and empirically solves the model merging problem via multi-teacher knowledge distillation, underpinned by a new PAC-Bayes analysis for merged model generalization. The SAMerging method achieves superior, stable, and data-efficient multitask generalization across modalities, establishing flatness-aware, theory-driven merging as the current paradigm for post-hoc multi-task model fusion (2512.21288).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper is about a smart way to combine several trained AI models into one model that can do many tasks well. Instead of retraining a huge model on all tasks together (which needs lots of data, time, and computing), the authors show how to “merge” existing task-specific models so the final model behaves like a good all-rounder. They also explain, with theory, why their merging method should work and how to make it reliable.

Key Objectives

The paper asks three simple questions:

How can we safely merge different fine-tuned models into one multi-task model without needing the original training data?
What rules or goals should we use when merging so the final model performs well on all tasks?
Can we design a method that is both accurate and data-efficient, using only a small number of unlabeled samples per task?

Methods and Approach

The main idea in everyday terms

Think of each task-specific model as a “teacher” and the merged model as a “student.” The student should learn to imitate each teacher on that teacher’s task.
The authors use a technique called knowledge distillation. The student looks at the teachers’ “soft” predictions (how likely each answer is) and tries to match them. This matching is measured by KL divergence, which is a way to check how different two sets of predictions are. You can think of it like comparing two pie charts: KL divergence is larger if the slices (probabilities) are very different.
They also use an optimizer called SAM (Sharpness-Aware Minimization). Imagine the loss surface as a bowl around the model’s parameters. A “sharp” bowl is like a steep, narrow valley—small changes can make performance drop quickly. A “flat” bowl is wide and gentle—small changes don’t hurt much. SAM nudges training toward flatter bowls, which often means better generalization to new data.

Why theory matters here

Merging models is tricky because the models were trained on different data and may not perfectly agree. The authors give a theoretical guarantee (a PAC-Bayes bound) that connects generalization (doing well on new data) to two things: 1) How flat the merged model’s loss bowl is. 2) How different the tasks are from each other (called cross-task heterogeneity).
This bound says: if we make the merged model sit in a flat region and we align it well with each teacher, we can expect lower error across tasks.
They also show that directly minimizing the student–teacher KL divergence reduces the merged model’s excess risk (extra errors beyond the best possible), giving a clear, principled target to optimize.

How the method works in practice

Use a few unlabeled examples from each task (as few as 16 per task).
For each task, feed these samples into the teacher model and the student (merged) model.
Minimize the KL divergence between the student’s predictions and the teacher’s predictions.
Apply SAM during optimization to find flatter solutions, improving stability and generalization.
The final merged model has no extra parts or overhead at test time—it’s just one model that can handle all tasks.

Main Findings

Here are the main results and why they matter:

The proposed method, called SAMerging, achieves state-of-the-art performance across image and language benchmarks compared to previous merging methods.
It needs far fewer unlabeled examples per task than other approaches (for example, 16 instead of thousands), making it very data-efficient.
It is robust to tricky details like how you scale or initialize merging coefficients—its performance is stable.
It works across different backbones (like CLIP ViT for vision and GPT-2/DeBERTa for language), showing it’s broadly useful.
The loss landscape visualizations show SAMerging finds flatter “bowls,” supporting the theoretical claim that flatter solutions generalize better.

Implications and Impact

For teams that have multiple specialized models and want one unified model without retraining on all data, SAMerging offers a practical, low-cost solution.
Because it only needs a small set of unlabeled samples per task, it fits well in settings where data sharing is restricted (privacy or legal concerns), like federated or on-device learning.
The method avoids extra test-time costs—no separate adapters or per-task heads—so it’s suitable for deployment on resource-limited devices.
The theory provides a roadmap: aim for flat minima and reduce student–teacher divergence to improve generalization in model merging.
Limitations remain in very hard cases, such as tasks that strongly interfere with each other or have conflicting label spaces, and the theory relies on local linear approximations. Still, the approach is a strong step toward reliable, data-efficient model merging.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Below is a single, focused list of what remains missing, uncertain, or unexplored in the paper, phrased to be concrete and actionable for follow-up research.

Validate the PAC-Bayes/NTK-based generalization guarantees beyond local linearization: quantify how far the merged parameters can move from the pretrained point before NTK assumptions break, and derive bounds that remain non-vacuous without NTK or with weaker smoothness/convexity requirements on the loss.
Tighten or refine the cross-task heterogeneity term: provide practical estimators for the heterogeneity quantity on unlabeled calibration data, and develop diagnostics that predict when merging will fail (e.g., thresholds to exclude tasks or downweight teachers).
Prior selection and posterior modeling: study how the choice of data-free prior P (mean and covariance) and Gaussian posterior assumptions affect bound tightness and algorithm behavior; propose ways to learn or approximate these distributions from available checkpoints.
Bridge mixture-posterior analysis to deterministic merged weights: quantify the mismatch between the theoretical mixture posterior and the actual deterministic, layer-wise parameter combination; develop bounds or algorithms that directly cover the deterministic merging procedure.
Coefficient learning and alignment with evaluation mixture: investigate methods to learn or estimate task weights α (evaluation mixture) from calibration data and to adapt β (analysis/merging mixture) accordingly, especially under unknown or shifting task prevalences.
Robustness to poor or miscalibrated teachers: design teacher-weighting or teacher-selection strategies that downweight unreliable fine-tuned models, and analyze how teacher error (the second term in Theorem 3) can be reduced or bounded in practice.
Alternatives to forward KL for multi-teacher distillation: compare divergences (reverse KL, Jensen–Shannon, α-divergences, MMD), temperature scaling, label smoothing, and confidence calibration to reduce brittleness when teachers disagree or are miscalibrated.
Sample efficiency and active calibration: derive sample-complexity guarantees linking the number of unlabeled examples per task to excess-risk reduction; explore active selection or data synthesis (e.g., generative augmentation) to optimize calibration under severe data scarcity.
Hyperparameter sensitivity and stability: systematically study the impact of SAM radius ρ, layer-wise coefficient initialization, learning rates, and temperature; provide principled tuning rules or adaptive schemes tied to the theoretical bound’s terms.
Flatness proxies at calibration-time: evaluate lighter-weight flatness proxies (e.g., Hessian trace/diagonal approximations, stochastic sharpness measures) that reduce SAMerging’s calibration-time cost, and quantify their trade-offs in bound-tightness and accuracy.
Handling extreme task heterogeneity: extend and stress-test SAMerging under heavy domain shift, conflicting or overlapping label spaces, multi-label classification, and imbalanced task difficulties; propose task-specific constraints or gating mechanisms to mitigate negative transfer.
Multi-modal and cross-architecture merging: assess generalization when merging models across modalities (e.g., vision–language–audio) or different architectures/backbones; define conditions under which cross-architecture merging is theoretically and empirically feasible.
Beyond classification: extend the framework to generative, sequence-to-sequence, detection, segmentation, and structured prediction tasks; adapt the distillation objective and bounds to non-0/1 losses and non-softmax output distributions.
Runtime and memory characterization: provide detailed measurements of calibration-time overhead, inference latency, and memory footprint across hardware; identify bottlenecks and propose engineering optimizations for practical deployment.
Layer-wise vs. block-/module-wise merging: compare granularity of coefficient learning (layer, block, attention module, LoRA adapter) and quantify how granularity affects heterogeneity terms, flatness, and empirical performance.
Weight-space symmetry and permutation issues: investigate whether permutation alignment (e.g., neuron matching) prior to merging further reduces heterogeneity and improves flatness, and formalize its effect on the bound.
Safety, robustness, and fairness: evaluate adversarial robustness, uncertainty calibration, OOD performance, and fairness across tasks post-merging; study whether flatter minima from SAMerging improve these properties or require additional regularization.
Adaptive task inclusion/exclusion: develop algorithms that automatically exclude or softly gate tasks that harm the merged model (based on heterogeneity estimates or validation losses), and provide theoretical guarantees for such gating.
Integration with parameter-efficient fine-tuning: quantify interactions between SAMerging and LoRA/adapters; assess whether post-hoc merging of PEFT modules can retain O(1) inference cost and still satisfy the bound’s assumptions.
Theoretical constants and margin conditions: replace coarse constants in the excess-risk bounds (Theorem 3) with tighter, margin-aware analyses (e.g., Tsybakov conditions), and study when these refinements yield actionable improvements in practice.
Task-weight learning under deployment dynamics: propose online estimation or reweighting of α as task arrival rates and domain distributions change over time, with guarantees on regret or stability in continual/streaming settings.
Diagnostics for NTK-faithful neighborhoods: provide practical tools to measure dispersion and kernel-weighted penalties highlighted in Theorem 2 on real models, guiding when to enforce distance-to-pretrain constraints or coefficient regularization.
Comparative evaluation breadth: expand baselines to include strong adapter/ensemble approaches with careful accounting of inference overhead; test larger LLMs and more diverse datasets to assess scalability and generality of SAMerging.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now using the paper’s method (SAMerging) and insights, given access to fine-tuned checkpoints and small unlabeled calibration sets per task.

Healthcare: Consolidate multiple departmental classifiers (e.g., radiology modality identification, pathology slide triage, dermatology condition screening) into a single on-prem model to cut inference latency and memory on hospital GPUs.
- Sector: healthcare
- Tools/products/workflows: “Federated Model-Merger” service that ingests task-specific checkpoints and performs SAMerging on a small unlabeled buffer; hospital MLOps pipeline adds a “Merge & Distill” stage before deployment; risk report using the cross-task heterogeneity score
- Assumptions/dependencies: access to fine-tuned checkpoints; 16–1600 unlabeled samples per task; current scope is classification; label spaces per task shouldn’t conflict; licenses allow weight consolidation
Content moderation and creator platforms: Replace ensembles of CLIP-based classifiers (nudity, violence, hate symbols, spam thumbnails) with a single merged model to reduce costs and improve responsiveness.
- Sector: software/media
- Tools/products/workflows: Hugging Face integration that wraps SAMerging as a “multitask consolidator” for CLIP variants; batch calibration on unlabeled platform data; automatic coefficient learning with SAM
- Assumptions/dependencies: small unlabeled samples represent platform distribution; tasks are classification; SAM adds one-time calibration compute
E-commerce catalog intelligence: Merge vision and text classifiers for attribute tagging (color, material, style) and quality filters into a unified model that runs at ingestion time.
- Sector: retail/e-commerce
- Tools/products/workflows: “Edge Multi-Task SDK” for warehouse devices and ingestion pipelines; CI/CD merge job; periodic re-calibration with fresh unlabeled samples
- Assumptions/dependencies: consistent encoder backbones across tasks; tasks are compatible classification problems; sufficient unlabeled item images/text
Manufacturing quality inspection: Unify multiple line-specific defect detectors into a single model to run on embedded cameras with O(1) memory/inference cost.
- Sector: industrial/robotics
- Tools/products/workflows: on-device SAMerging using small unlabeled buffers from each line; coefficient regularization to stay near the pretrained region (NTK-faithful)
- Assumptions/dependencies: compatible backbones; unlabeled frames per line; performance depends on flatness of fine-tuned models and low cross-task heterogeneity
Security and surveillance: Merge object, action, and scene classifiers into one camera model for analytics on NVRs or edge gateways.
- Sector: public safety/IoT
- Tools/products/workflows: “Surveillance Merger” appliance that periodically calibrates on unlabeled footage; deployment of a single backbone per device
- Assumptions/dependencies: classification scope; unlabeled footage availability; legal compliance for model combination and calibration data use
Back-office NLP: Consolidate intent detection, sentiment analysis, spam filtering, topic routing into a single GPT-2/DeBERTa-based classifier for email/helpdesk queues.
- Sector: enterprise software
- Tools/products/workflows: MLOps plugin providing layer-wise merging; KD loss with task-wise batches; monitoring of the heterogeneity term to gate merges
- Assumptions/dependencies: similar model architecture across tasks; unlabeled text buffers; tasks not generative; domain drift monitored
Finance operations: Merge document classifiers (KYC doc type, AML risk flags), ticket triage, and complaint sentiment into one model for back-office automation.
- Sector: finance
- Tools/products/workflows: “Model Consolidation Service” in bank MLOps; excess-risk summaries (from KL fit term) for internal audit
- Assumptions/dependencies: classification; strict license/compliance checks; calibration on unlabeled internal corpora; risk acceptance based on bound reports
Edge devices and mobile: Ship a single on-device model that handles multiple perception tasks (scene type, quality score, blur detection, OCR triage) in camera apps, reducing app memory and battery drain.
- Sector: consumer/daily life
- Tools/products/workflows: SDK offering SAMerging as an offline step with user-opt-in local photos for unlabeled calibration; dynamic coefficient initialization that minimizes sensitivity
- Assumptions/dependencies: small local unlabeled buffers; privacy-preserving on-device calibration; tasks are classification; compute headroom for one-time merge
Federated learning aggregators: Merge client-specific fine-tuned models without sharing raw data, using only small unlabeled calibration batches from the federation’s target domains.
- Sector: privacy/federated AI
- Tools/products/workflows: “Federated Merger” server role combining client checkpoints via multi-teacher KD; coefficient assignment that down-weights sharp or outlier models
- Assumptions/dependencies: client model compatibility; unlabeled target-domain samples centrally available; governance for client model licensing and privacy
MLOps and research infrastructure: Introduce a heterogeneity-aware merge stage that auto-tunes layer-wise coefficients with SAM and produces bound-derived diagnostics (flatness proxies, mixture gaps).
- Sector: academia/industry research
- Tools/products/workflows: PAC-Bayes “Heterogeneity Monitor” to predict mergeability; visualization of loss landscape stability; ablation reports (SAM vs no-SAM; KL vs entropy)
- Assumptions/dependencies: shared pretrained base; local NTK approximation reasonable (merge stays near base); reproducible access to task vectors/checkpoints

Long-Term Applications

These applications require further research, scaling, or development beyond the current classification-focused scope and/or assumptions in the paper.

Generative model merging (LLMs and diffusion): Extend SAMerging and the excess-risk guarantees to autoregressive and generative tasks (summarization, code generation, image synthesis), enabling a single model to retain specialized generative capabilities across domains.
- Sector: software/creative tools/education
- Potential product/workflow: “Generative Merger” with sequence-level KD and flatness-aware objectives; task-aware prompts and adapters replaced by unified backbone
- Assumptions/dependencies: new theory for generative losses; careful handling of label spaces and prompts; evaluation beyond classification; compute for calibration
Multi-label and conflicting label-space merging: Develop methods to handle overlapping or conflicting label sets and multi-label classification where current bounds and objectives may need adaptation.
- Sector: healthcare/media/security
- Potential product/workflow: label-space alignment tools; constraint-aware coefficient learning to mitigate negative transfer
- Assumptions/dependencies: extended PAC-Bayes terms for multi-label; robust strategies for domain shift and label conflicts
Dynamic, inference-time coefficient adaptation: Adjust merge coefficients online using small unlabeled buffers to counter domain drift (e.g., seasonal catalog changes, new content trends).
- Sector: e-commerce/media
- Potential product/workflow: “Adaptive Merger Runtime” with periodic SAM-enhanced micro-calibrations; drift detectors trigger re-merge
- Assumptions/dependencies: budgeted on-device/server compute; stable safety/latency under frequent merges; robust monitoring of heterogeneity
Continual and personalized federated learning: Periodically merge personalized client models (per user/site) into a global backbone, maintaining privacy while improving global generalization.
- Sector: privacy/edge AI/healthcare
- Potential product/workflow: federated orchestration that alternates local fine-tuning with central SAMerging; client-side personalization through small local KD
- Assumptions/dependencies: fairness and personalization metrics; licenses permitting checkpoint sharing; secure handling of unlabeled calibration buffers
Cross-modal and multimodal assistants (MLLMs): Merge specialized models across text, image, and audio tasks into one multimodal assistant backbone with O(1) inference overhead.
- Sector: robotics/education/enterprise support
- Potential product/workflow: modality-aware KD loss; cross-modal flatness proxies; unified inference pipeline replacing adapters/heads
- Assumptions/dependencies: compatible multimodal backbones; new bounds capturing cross-modal heterogeneity; larger-scale calibration datasets
Safety, fairness, and regulatory auditing: Use the paper’s heterogeneity term and bound components as quantitative signals in AI risk assessments (e.g., to justify consolidation without data sharing).
- Sector: policy/regulation
- Potential product/workflow: standardized “Merge Risk Report” for audits summarizing flatness, dispersion, mixture gaps; policy templates encouraging data-minimization via unlabeled calibration
- Assumptions/dependencies: accepted methodologies for bound reporting; governance for model provenance and licensing; sector-specific compliance criteria
Carbon and cost accounting for AI: Institutionalize energy and cost savings from replacing n models with a single merged backbone, informing green AI procurement and SLAs.
- Sector: energy/datacenter operations
- Potential product/workflow: “Merge-to-Green” calculators correlating O(1) inference cost with kWh and dollars saved; procurement policies favoring consolidation
- Assumptions/dependencies: reliable measurements of workload characteristics; agreement on reporting standards; validation under varying task loads
Adapter-less enterprise AI platform: Offer a managed service that ingests diverse fine-tuned models from different teams and emits one audited, heterogeneity-aware merged model.
- Sector: enterprise software
- Potential product/workflow: multi-tenant model registry with merge governance; pre-merge simulation using bound-driven criteria; rollback and A/B tools
- Assumptions/dependencies: inter-team alignment on backbones and licenses; strong observability; escalation paths when heterogeneity is high
On-device personalization for end-users: Merge vendor models with per-user fine-tunes locally using only unlabeled personal data (e.g., camera gallery, notes), producing a single personalized backbone.
- Sector: consumer/daily life
- Potential product/workflow: “Personal Merge” app; privacy-preserving local KD; periodic lightweight SAM calibration
- Assumptions/dependencies: on-device compute headroom; secure storage; user consent; safeguards to avoid overfitting or bias
Research extensions and curricula: Formalize mergeability diagnostics, coefficient regularizers, and lighter flatness proxies that reduce SAM overhead, and integrate into ML courses and toolkits.
- Sector: academia
- Potential product/workflow: educational modules on PAC-Bayes and model merging; benchmark suites expanding beyond classification; open-source libraries implementing heterogeneity-aware merges
- Assumptions/dependencies: community adoption; reproducibility across tasks and backbones; broader empirical validation beyond NTK-local regimes

View Paper Prompt View All Prompts

Glossary

AdaMerging: A data-dependent model merging method that learns per-layer or per-task coefficients by optimizing a confidence heuristic. "AdaMerging learns (layer-/task-wise) merge coefficients by minimizing entropy [Yang et al., 2023]."
Bayes optimal risk: The minimum possible classification error achievable by any classifier given the true conditional label distribution. "and the Bayes optimal risk by CQ-1,* := E(x,y)~D[1-max s(y | x)]."
Cross-task heterogeneity: A term quantifying how much a fine-tuned model’s performance deteriorates when applied to a different task’s distribution. "This analysis introduces a "cross-task heterogeneity" term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions."
Entropy minimization: An optimization heuristic that encourages a model to produce confident (low-entropy) predictions, not necessarily aligned with correctness. "methods such as AdaMerging [Yang et al., 2023] use entropy minimization without an explicit excess-risk guarantee."
Evaluation mixture: A weighted combination of task distributions used to evaluate multi-task performance. "Evaluation mixture. Let & E AT-1 denote the weights of the evaluation mixture across tasks."
Excess risk: The gap between a model’s error and the Bayes optimal error; used to assess how much worse a model is than the best possible. "We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk."
Fisher Information Matrix (FIM): A matrix capturing the curvature of the loss landscape with respect to model parameters; used to weight parameter merging. "Fisher Merging estimates Fisher information from gradients on unlabeled data to make the Fisher Information Matrix (FIM) [Matena and Raffel, 2022]"
Fisher Merging: A model merging technique that uses the Fisher Information Matrix to weight parameters during averaging. "Fisher Merging estimates Fisher information from gradients on unlabeled data to make the Fisher Information Matrix (FIM) [Matena and Raffel, 2022] and uses that as the weight for merging"
Gaussian posterior: A Bayesian posterior distribution over model parameters modeled as a Gaussian. "for all Gaussian posteriors Qt = N(Ult, Et)"
Inductive bias: Prior assumptions embedded in a learning algorithm that guide generalization across tasks. "enables knowledge transfer with inductive bias and shared representations [Caruana, 1997, Baxter, 2000, Wu et al., 2023]."
Isotropic Merging: A data-free merging approach that aims to balance parameter contributions uniformly across directions. "Isotropic Merging [Marczak et al., 2025]"
Kullback-Leibler (KL) divergence: A measure of how one probability distribution diverges from another; used to align student and teacher distributions. "We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk."
Label-free calibration: Adapting a merged model using unlabeled data to improve multi-task performance without incurring inference overhead. "We thus trade minimal label-free calibration for zero inference overhead."
Loss basin: A region in parameter space around a solution with similar loss values; flatter basins correlate with better generalization. "flatness of its loss basin"
Loss landscape: The shape of the loss function over parameter space, including minima and sharpness properties. "Flatter loss landscapes have long been associated with better generalization"
Mixture-of-experts: An architecture that routes inputs to specialized expert subnetworks and combines their outputs. "mixture-of-experts to learn which experts to share per task [Hazimeh et al., 2021, Tang et al., 2020]."
Mixture posterior: A randomized predictor formed by mixing task-specific posteriors according to weights. "We analyze a mixture posterior Qmerge := E BIQt,"
Model merging: The process of combining multiple task-specific fine-tuned models into a single multi-task model without joint training. "Model merging seeks to combine fine- tuned models into a single model"
Model soups: Simple weight-averaging of multiple fine-tuned models to improve accuracy without extra inference cost. "simple weight averaging/model soups [Wortsman et al., 2022]"
Multi-task learning (MTL): Training a single model on multiple tasks simultaneously to leverage shared representations. "Joint training for multi-task learning (MTL) aggregates data from different tasks to learn them jointly."
Multi-teacher knowledge distillation: Compressing several teacher models into a single student by matching their output distributions. "We cast model merging as multi-teacher knowledge distillation"
Neural Tangent Kernel (NTK): A kernel that approximates neural network behavior via linearization around a reference point. "NTK approximation as done in Jacot et al. [2020]"
Non-vacuous bounds: Generalization bounds that provide meaningful (non-trivial) guarantees rather than loose or uninformative ones. "yielding non-vacuous bounds [Neyshabur et al., 2017, Petzka et al., 2021, Dziugaite and Roy, 2017]."
PAC-Bayes generalization bound: A probabilistic bound on generalization error leveraging Bayesian priors/posteriors over parameters. "we establish a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting."
Pareto-optimal trade-offs: Solutions in multi-objective optimization where improving one objective would worsen another. "seek Pareto-optimal trade-offs with convergence guarantees and controllable preferences [Lin et al., 2019, Shamsian et al., 2023]."
Posterior concentration: The phenomenon of a posterior distribution placing most mass in a region of low loss, aiding tighter bounds. "such that when weight posterior concen- trates in a broad low-loss region, complexity terms shrink, yielding non-vacuous bounds"
Prior: A distribution over parameters before observing data, used in PAC-Bayes analysis. "Let P = N(pp, Ep) be a data-free prior on O"
Posterior: A distribution over parameters after observing data; task-dependent in model merging. "for each task t let Qt = N (Alt, Et) be a task-dependent posterior."
RegMean: A data-dependent merging method that regularizes averaging using feature Grammian inner-products. "RegMean/RegMean++ compute feature inner-product Gram to regularize averaging"
RegMean++: An enhanced version of RegMean with improved regularization for merging. "RegMean/RegMean++ compute feature inner-product Gram to regularize averaging"
Sharpness-Aware Minimization (SAM): An optimization method that penalizes worst-case loss in a local neighborhood to find flatter minima. "Sharpness-Aware Minimization (SAM) [Foret et al., 2021] achieves flatter minima by penalizing the worst-case loss in a neighborhood"
SAMerging: The proposed method that merges models via multi-teacher distillation and SAM to seek flat minima. "SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima."
Task arithmetic: A merging approach that treats fine-tuned model offsets from pretrained weights as vectors that can be scaled and summed. "One line of methods in model merging is based on the notion of "task arithmetic" [Ilharco et al., 2023], which treats each fine-tuned model's offset from the pretrained weights as a task vector."
Task vector: The parameter offset of a fine-tuned model from the pretrained weights, used as a direction in merging. "treats each fine-tuned model's offset from the pretrained weights as a task vector."
TIES-Merging: A merging method designed to resolve interference when combining models. "TIES-Merging [Yadav et al., 2023]"

Model Merging via Multi-Teacher Knowledge Distillation

Summary

Model Merging via Multi-Teacher Knowledge Distillation: A Formal and Empirical Perspective

Introduction and Context

Theoretical Contributions

Methodology: Knowledge Distillation and Flatness-Aware Optimization

Empirical Validation

Implications and Direction for Future Research

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Objectives

Methods and Approach

The main idea in everyday terms

Why theory matters here

How the method works in practice

Main Findings

Implications and Impact

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Authors (2)

Collections

GitHub

YouTube

Model Merging via Multi-Teacher Knowledge Distillation

Summary

Model Merging via Multi-Teacher Knowledge Distillation: A Formal and Empirical Perspective

Introduction and Context

Theoretical Contributions

Methodology: Knowledge Distillation and Flatness-Aware Optimization

Empirical Validation

Implications and Direction for Future Research

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Objectives

Methods and Approach

The main idea in everyday terms

Why theory matters here

How the method works in practice

Main Findings

Implications and Impact

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections

GitHub

YouTube