Model Merging via Multi-Teacher Knowledge Distillation
Abstract: Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model's contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address this gap by (i) establishing a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting. This analysis introduces a "cross-task heterogeneity" term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions. Guided by this theoretical insight, (ii) we frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data. We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk. Guided by the flatness-aware bound derived, (iii) we operationalize this objective via SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima. Empirically, SAMerging establishes a new state of the art across vision and NLP benchmarks, achieving remarkable performance. The code is available at https://github.com/arshandalili/SAMerging.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about a smart way to combine several trained AI models into one model that can do many tasks well. Instead of retraining a huge model on all tasks together (which needs lots of data, time, and computing), the authors show how to “merge” existing task-specific models so the final model behaves like a good all-rounder. They also explain, with theory, why their merging method should work and how to make it reliable.
Key Objectives
The paper asks three simple questions:
- How can we safely merge different fine-tuned models into one multi-task model without needing the original training data?
- What rules or goals should we use when merging so the final model performs well on all tasks?
- Can we design a method that is both accurate and data-efficient, using only a small number of unlabeled samples per task?
Methods and Approach
The main idea in everyday terms
- Think of each task-specific model as a “teacher” and the merged model as a “student.” The student should learn to imitate each teacher on that teacher’s task.
- The authors use a technique called knowledge distillation. The student looks at the teachers’ “soft” predictions (how likely each answer is) and tries to match them. This matching is measured by KL divergence, which is a way to check how different two sets of predictions are. You can think of it like comparing two pie charts: KL divergence is larger if the slices (probabilities) are very different.
- They also use an optimizer called SAM (Sharpness-Aware Minimization). Imagine the loss surface as a bowl around the model’s parameters. A “sharp” bowl is like a steep, narrow valley—small changes can make performance drop quickly. A “flat” bowl is wide and gentle—small changes don’t hurt much. SAM nudges training toward flatter bowls, which often means better generalization to new data.
Why theory matters here
- Merging models is tricky because the models were trained on different data and may not perfectly agree. The authors give a theoretical guarantee (a PAC-Bayes bound) that connects generalization (doing well on new data) to two things: 1) How flat the merged model’s loss bowl is. 2) How different the tasks are from each other (called cross-task heterogeneity).
- This bound says: if we make the merged model sit in a flat region and we align it well with each teacher, we can expect lower error across tasks.
- They also show that directly minimizing the student–teacher KL divergence reduces the merged model’s excess risk (extra errors beyond the best possible), giving a clear, principled target to optimize.
How the method works in practice
- Use a few unlabeled examples from each task (as few as 16 per task).
- For each task, feed these samples into the teacher model and the student (merged) model.
- Minimize the KL divergence between the student’s predictions and the teacher’s predictions.
- Apply SAM during optimization to find flatter solutions, improving stability and generalization.
- The final merged model has no extra parts or overhead at test time—it’s just one model that can handle all tasks.
Main Findings
Here are the main results and why they matter:
- The proposed method, called SAMerging, achieves state-of-the-art performance across image and language benchmarks compared to previous merging methods.
- It needs far fewer unlabeled examples per task than other approaches (for example, 16 instead of thousands), making it very data-efficient.
- It is robust to tricky details like how you scale or initialize merging coefficients—its performance is stable.
- It works across different backbones (like CLIP ViT for vision and GPT-2/DeBERTa for language), showing it’s broadly useful.
- The loss landscape visualizations show SAMerging finds flatter “bowls,” supporting the theoretical claim that flatter solutions generalize better.
Implications and Impact
- For teams that have multiple specialized models and want one unified model without retraining on all data, SAMerging offers a practical, low-cost solution.
- Because it only needs a small set of unlabeled samples per task, it fits well in settings where data sharing is restricted (privacy or legal concerns), like federated or on-device learning.
- The method avoids extra test-time costs—no separate adapters or per-task heads—so it’s suitable for deployment on resource-limited devices.
- The theory provides a roadmap: aim for flat minima and reduce student–teacher divergence to improve generalization in model merging.
- Limitations remain in very hard cases, such as tasks that strongly interfere with each other or have conflicting label spaces, and the theory relies on local linear approximations. Still, the approach is a strong step toward reliable, data-efficient model merging.
Knowledge Gaps
Knowledge Gaps, Limitations, and Open Questions
Below is a single, focused list of what remains missing, uncertain, or unexplored in the paper, phrased to be concrete and actionable for follow-up research.
- Validate the PAC-Bayes/NTK-based generalization guarantees beyond local linearization: quantify how far the merged parameters can move from the pretrained point before NTK assumptions break, and derive bounds that remain non-vacuous without NTK or with weaker smoothness/convexity requirements on the loss.
- Tighten or refine the cross-task heterogeneity term: provide practical estimators for the heterogeneity quantity on unlabeled calibration data, and develop diagnostics that predict when merging will fail (e.g., thresholds to exclude tasks or downweight teachers).
- Prior selection and posterior modeling: study how the choice of data-free prior P (mean and covariance) and Gaussian posterior assumptions affect bound tightness and algorithm behavior; propose ways to learn or approximate these distributions from available checkpoints.
- Bridge mixture-posterior analysis to deterministic merged weights: quantify the mismatch between the theoretical mixture posterior and the actual deterministic, layer-wise parameter combination; develop bounds or algorithms that directly cover the deterministic merging procedure.
- Coefficient learning and alignment with evaluation mixture: investigate methods to learn or estimate task weights α (evaluation mixture) from calibration data and to adapt β (analysis/merging mixture) accordingly, especially under unknown or shifting task prevalences.
- Robustness to poor or miscalibrated teachers: design teacher-weighting or teacher-selection strategies that downweight unreliable fine-tuned models, and analyze how teacher error (the second term in Theorem 3) can be reduced or bounded in practice.
- Alternatives to forward KL for multi-teacher distillation: compare divergences (reverse KL, Jensen–Shannon, α-divergences, MMD), temperature scaling, label smoothing, and confidence calibration to reduce brittleness when teachers disagree or are miscalibrated.
- Sample efficiency and active calibration: derive sample-complexity guarantees linking the number of unlabeled examples per task to excess-risk reduction; explore active selection or data synthesis (e.g., generative augmentation) to optimize calibration under severe data scarcity.
- Hyperparameter sensitivity and stability: systematically study the impact of SAM radius ρ, layer-wise coefficient initialization, learning rates, and temperature; provide principled tuning rules or adaptive schemes tied to the theoretical bound’s terms.
- Flatness proxies at calibration-time: evaluate lighter-weight flatness proxies (e.g., Hessian trace/diagonal approximations, stochastic sharpness measures) that reduce SAMerging’s calibration-time cost, and quantify their trade-offs in bound-tightness and accuracy.
- Handling extreme task heterogeneity: extend and stress-test SAMerging under heavy domain shift, conflicting or overlapping label spaces, multi-label classification, and imbalanced task difficulties; propose task-specific constraints or gating mechanisms to mitigate negative transfer.
- Multi-modal and cross-architecture merging: assess generalization when merging models across modalities (e.g., vision–language–audio) or different architectures/backbones; define conditions under which cross-architecture merging is theoretically and empirically feasible.
- Beyond classification: extend the framework to generative, sequence-to-sequence, detection, segmentation, and structured prediction tasks; adapt the distillation objective and bounds to non-0/1 losses and non-softmax output distributions.
- Runtime and memory characterization: provide detailed measurements of calibration-time overhead, inference latency, and memory footprint across hardware; identify bottlenecks and propose engineering optimizations for practical deployment.
- Layer-wise vs. block-/module-wise merging: compare granularity of coefficient learning (layer, block, attention module, LoRA adapter) and quantify how granularity affects heterogeneity terms, flatness, and empirical performance.
- Weight-space symmetry and permutation issues: investigate whether permutation alignment (e.g., neuron matching) prior to merging further reduces heterogeneity and improves flatness, and formalize its effect on the bound.
- Safety, robustness, and fairness: evaluate adversarial robustness, uncertainty calibration, OOD performance, and fairness across tasks post-merging; study whether flatter minima from SAMerging improve these properties or require additional regularization.
- Adaptive task inclusion/exclusion: develop algorithms that automatically exclude or softly gate tasks that harm the merged model (based on heterogeneity estimates or validation losses), and provide theoretical guarantees for such gating.
- Integration with parameter-efficient fine-tuning: quantify interactions between SAMerging and LoRA/adapters; assess whether post-hoc merging of PEFT modules can retain O(1) inference cost and still satisfy the bound’s assumptions.
- Theoretical constants and margin conditions: replace coarse constants in the excess-risk bounds (Theorem 3) with tighter, margin-aware analyses (e.g., Tsybakov conditions), and study when these refinements yield actionable improvements in practice.
- Task-weight learning under deployment dynamics: propose online estimation or reweighting of α as task arrival rates and domain distributions change over time, with guarantees on regret or stability in continual/streaming settings.
- Diagnostics for NTK-faithful neighborhoods: provide practical tools to measure dispersion and kernel-weighted penalties highlighted in Theorem 2 on real models, guiding when to enforce distance-to-pretrain constraints or coefficient regularization.
- Comparative evaluation breadth: expand baselines to include strong adapter/ensemble approaches with careful accounting of inference overhead; test larger LLMs and more diverse datasets to assess scalability and generality of SAMerging.
Practical Applications
Immediate Applications
The following applications can be deployed now using the paper’s method (SAMerging) and insights, given access to fine-tuned checkpoints and small unlabeled calibration sets per task.
- Healthcare: Consolidate multiple departmental classifiers (e.g., radiology modality identification, pathology slide triage, dermatology condition screening) into a single on-prem model to cut inference latency and memory on hospital GPUs.
- Sector: healthcare
- Tools/products/workflows: “Federated Model-Merger” service that ingests task-specific checkpoints and performs SAMerging on a small unlabeled buffer; hospital MLOps pipeline adds a “Merge & Distill” stage before deployment; risk report using the cross-task heterogeneity score
- Assumptions/dependencies: access to fine-tuned checkpoints; 16–1600 unlabeled samples per task; current scope is classification; label spaces per task shouldn’t conflict; licenses allow weight consolidation
- Content moderation and creator platforms: Replace ensembles of CLIP-based classifiers (nudity, violence, hate symbols, spam thumbnails) with a single merged model to reduce costs and improve responsiveness.
- Sector: software/media
- Tools/products/workflows: Hugging Face integration that wraps SAMerging as a “multitask consolidator” for CLIP variants; batch calibration on unlabeled platform data; automatic coefficient learning with SAM
- Assumptions/dependencies: small unlabeled samples represent platform distribution; tasks are classification; SAM adds one-time calibration compute
- E-commerce catalog intelligence: Merge vision and text classifiers for attribute tagging (color, material, style) and quality filters into a unified model that runs at ingestion time.
- Sector: retail/e-commerce
- Tools/products/workflows: “Edge Multi-Task SDK” for warehouse devices and ingestion pipelines; CI/CD merge job; periodic re-calibration with fresh unlabeled samples
- Assumptions/dependencies: consistent encoder backbones across tasks; tasks are compatible classification problems; sufficient unlabeled item images/text
- Manufacturing quality inspection: Unify multiple line-specific defect detectors into a single model to run on embedded cameras with O(1) memory/inference cost.
- Sector: industrial/robotics
- Tools/products/workflows: on-device SAMerging using small unlabeled buffers from each line; coefficient regularization to stay near the pretrained region (NTK-faithful)
- Assumptions/dependencies: compatible backbones; unlabeled frames per line; performance depends on flatness of fine-tuned models and low cross-task heterogeneity
- Security and surveillance: Merge object, action, and scene classifiers into one camera model for analytics on NVRs or edge gateways.
- Sector: public safety/IoT
- Tools/products/workflows: “Surveillance Merger” appliance that periodically calibrates on unlabeled footage; deployment of a single backbone per device
- Assumptions/dependencies: classification scope; unlabeled footage availability; legal compliance for model combination and calibration data use
- Back-office NLP: Consolidate intent detection, sentiment analysis, spam filtering, topic routing into a single GPT-2/DeBERTa-based classifier for email/helpdesk queues.
- Sector: enterprise software
- Tools/products/workflows: MLOps plugin providing layer-wise merging; KD loss with task-wise batches; monitoring of the heterogeneity term to gate merges
- Assumptions/dependencies: similar model architecture across tasks; unlabeled text buffers; tasks not generative; domain drift monitored
- Finance operations: Merge document classifiers (KYC doc type, AML risk flags), ticket triage, and complaint sentiment into one model for back-office automation.
- Sector: finance
- Tools/products/workflows: “Model Consolidation Service” in bank MLOps; excess-risk summaries (from KL fit term) for internal audit
- Assumptions/dependencies: classification; strict license/compliance checks; calibration on unlabeled internal corpora; risk acceptance based on bound reports
- Edge devices and mobile: Ship a single on-device model that handles multiple perception tasks (scene type, quality score, blur detection, OCR triage) in camera apps, reducing app memory and battery drain.
- Sector: consumer/daily life
- Tools/products/workflows: SDK offering SAMerging as an offline step with user-opt-in local photos for unlabeled calibration; dynamic coefficient initialization that minimizes sensitivity
- Assumptions/dependencies: small local unlabeled buffers; privacy-preserving on-device calibration; tasks are classification; compute headroom for one-time merge
- Federated learning aggregators: Merge client-specific fine-tuned models without sharing raw data, using only small unlabeled calibration batches from the federation’s target domains.
- Sector: privacy/federated AI
- Tools/products/workflows: “Federated Merger” server role combining client checkpoints via multi-teacher KD; coefficient assignment that down-weights sharp or outlier models
- Assumptions/dependencies: client model compatibility; unlabeled target-domain samples centrally available; governance for client model licensing and privacy
- MLOps and research infrastructure: Introduce a heterogeneity-aware merge stage that auto-tunes layer-wise coefficients with SAM and produces bound-derived diagnostics (flatness proxies, mixture gaps).
- Sector: academia/industry research
- Tools/products/workflows: PAC-Bayes “Heterogeneity Monitor” to predict mergeability; visualization of loss landscape stability; ablation reports (SAM vs no-SAM; KL vs entropy)
- Assumptions/dependencies: shared pretrained base; local NTK approximation reasonable (merge stays near base); reproducible access to task vectors/checkpoints
Long-Term Applications
These applications require further research, scaling, or development beyond the current classification-focused scope and/or assumptions in the paper.
- Generative model merging (LLMs and diffusion): Extend SAMerging and the excess-risk guarantees to autoregressive and generative tasks (summarization, code generation, image synthesis), enabling a single model to retain specialized generative capabilities across domains.
- Sector: software/creative tools/education
- Potential product/workflow: “Generative Merger” with sequence-level KD and flatness-aware objectives; task-aware prompts and adapters replaced by unified backbone
- Assumptions/dependencies: new theory for generative losses; careful handling of label spaces and prompts; evaluation beyond classification; compute for calibration
- Multi-label and conflicting label-space merging: Develop methods to handle overlapping or conflicting label sets and multi-label classification where current bounds and objectives may need adaptation.
- Sector: healthcare/media/security
- Potential product/workflow: label-space alignment tools; constraint-aware coefficient learning to mitigate negative transfer
- Assumptions/dependencies: extended PAC-Bayes terms for multi-label; robust strategies for domain shift and label conflicts
- Dynamic, inference-time coefficient adaptation: Adjust merge coefficients online using small unlabeled buffers to counter domain drift (e.g., seasonal catalog changes, new content trends).
- Sector: e-commerce/media
- Potential product/workflow: “Adaptive Merger Runtime” with periodic SAM-enhanced micro-calibrations; drift detectors trigger re-merge
- Assumptions/dependencies: budgeted on-device/server compute; stable safety/latency under frequent merges; robust monitoring of heterogeneity
- Continual and personalized federated learning: Periodically merge personalized client models (per user/site) into a global backbone, maintaining privacy while improving global generalization.
- Sector: privacy/edge AI/healthcare
- Potential product/workflow: federated orchestration that alternates local fine-tuning with central SAMerging; client-side personalization through small local KD
- Assumptions/dependencies: fairness and personalization metrics; licenses permitting checkpoint sharing; secure handling of unlabeled calibration buffers
- Cross-modal and multimodal assistants (MLLMs): Merge specialized models across text, image, and audio tasks into one multimodal assistant backbone with O(1) inference overhead.
- Sector: robotics/education/enterprise support
- Potential product/workflow: modality-aware KD loss; cross-modal flatness proxies; unified inference pipeline replacing adapters/heads
- Assumptions/dependencies: compatible multimodal backbones; new bounds capturing cross-modal heterogeneity; larger-scale calibration datasets
- Safety, fairness, and regulatory auditing: Use the paper’s heterogeneity term and bound components as quantitative signals in AI risk assessments (e.g., to justify consolidation without data sharing).
- Sector: policy/regulation
- Potential product/workflow: standardized “Merge Risk Report” for audits summarizing flatness, dispersion, mixture gaps; policy templates encouraging data-minimization via unlabeled calibration
- Assumptions/dependencies: accepted methodologies for bound reporting; governance for model provenance and licensing; sector-specific compliance criteria
- Carbon and cost accounting for AI: Institutionalize energy and cost savings from replacing n models with a single merged backbone, informing green AI procurement and SLAs.
- Sector: energy/datacenter operations
- Potential product/workflow: “Merge-to-Green” calculators correlating O(1) inference cost with kWh and dollars saved; procurement policies favoring consolidation
- Assumptions/dependencies: reliable measurements of workload characteristics; agreement on reporting standards; validation under varying task loads
- Adapter-less enterprise AI platform: Offer a managed service that ingests diverse fine-tuned models from different teams and emits one audited, heterogeneity-aware merged model.
- Sector: enterprise software
- Potential product/workflow: multi-tenant model registry with merge governance; pre-merge simulation using bound-driven criteria; rollback and A/B tools
- Assumptions/dependencies: inter-team alignment on backbones and licenses; strong observability; escalation paths when heterogeneity is high
- On-device personalization for end-users: Merge vendor models with per-user fine-tunes locally using only unlabeled personal data (e.g., camera gallery, notes), producing a single personalized backbone.
- Sector: consumer/daily life
- Potential product/workflow: “Personal Merge” app; privacy-preserving local KD; periodic lightweight SAM calibration
- Assumptions/dependencies: on-device compute headroom; secure storage; user consent; safeguards to avoid overfitting or bias
- Research extensions and curricula: Formalize mergeability diagnostics, coefficient regularizers, and lighter flatness proxies that reduce SAM overhead, and integrate into ML courses and toolkits.
- Sector: academia
- Potential product/workflow: educational modules on PAC-Bayes and model merging; benchmark suites expanding beyond classification; open-source libraries implementing heterogeneity-aware merges
- Assumptions/dependencies: community adoption; reproducibility across tasks and backbones; broader empirical validation beyond NTK-local regimes
Glossary
- AdaMerging: A data-dependent model merging method that learns per-layer or per-task coefficients by optimizing a confidence heuristic. "AdaMerging learns (layer-/task-wise) merge coefficients by minimizing entropy [Yang et al., 2023]."
- Bayes optimal risk: The minimum possible classification error achievable by any classifier given the true conditional label distribution. "and the Bayes optimal risk by CQ-1,* := E(x,y)~D[1-max s(y | x)]."
- Cross-task heterogeneity: A term quantifying how much a fine-tuned model’s performance deteriorates when applied to a different task’s distribution. "This analysis introduces a "cross-task heterogeneity" term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions."
- Entropy minimization: An optimization heuristic that encourages a model to produce confident (low-entropy) predictions, not necessarily aligned with correctness. "methods such as AdaMerging [Yang et al., 2023] use entropy minimization without an explicit excess-risk guarantee."
- Evaluation mixture: A weighted combination of task distributions used to evaluate multi-task performance. "Evaluation mixture. Let & E AT-1 denote the weights of the evaluation mixture across tasks."
- Excess risk: The gap between a model’s error and the Bayes optimal error; used to assess how much worse a model is than the best possible. "We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk."
- Fisher Information Matrix (FIM): A matrix capturing the curvature of the loss landscape with respect to model parameters; used to weight parameter merging. "Fisher Merging estimates Fisher information from gradients on unlabeled data to make the Fisher Information Matrix (FIM) [Matena and Raffel, 2022]"
- Fisher Merging: A model merging technique that uses the Fisher Information Matrix to weight parameters during averaging. "Fisher Merging estimates Fisher information from gradients on unlabeled data to make the Fisher Information Matrix (FIM) [Matena and Raffel, 2022] and uses that as the weight for merging"
- Gaussian posterior: A Bayesian posterior distribution over model parameters modeled as a Gaussian. "for all Gaussian posteriors Qt = N(Ult, Et)"
- Inductive bias: Prior assumptions embedded in a learning algorithm that guide generalization across tasks. "enables knowledge transfer with inductive bias and shared representations [Caruana, 1997, Baxter, 2000, Wu et al., 2023]."
- Isotropic Merging: A data-free merging approach that aims to balance parameter contributions uniformly across directions. "Isotropic Merging [Marczak et al., 2025]"
- Kullback-Leibler (KL) divergence: A measure of how one probability distribution diverges from another; used to align student and teacher distributions. "We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk."
- Label-free calibration: Adapting a merged model using unlabeled data to improve multi-task performance without incurring inference overhead. "We thus trade minimal label-free calibration for zero inference overhead."
- Loss basin: A region in parameter space around a solution with similar loss values; flatter basins correlate with better generalization. "flatness of its loss basin"
- Loss landscape: The shape of the loss function over parameter space, including minima and sharpness properties. "Flatter loss landscapes have long been associated with better generalization"
- Mixture-of-experts: An architecture that routes inputs to specialized expert subnetworks and combines their outputs. "mixture-of-experts to learn which experts to share per task [Hazimeh et al., 2021, Tang et al., 2020]."
- Mixture posterior: A randomized predictor formed by mixing task-specific posteriors according to weights. "We analyze a mixture posterior Qmerge := E BIQt,"
- Model merging: The process of combining multiple task-specific fine-tuned models into a single multi-task model without joint training. "Model merging seeks to combine fine- tuned models into a single model"
- Model soups: Simple weight-averaging of multiple fine-tuned models to improve accuracy without extra inference cost. "simple weight averaging/model soups [Wortsman et al., 2022]"
- Multi-task learning (MTL): Training a single model on multiple tasks simultaneously to leverage shared representations. "Joint training for multi-task learning (MTL) aggregates data from different tasks to learn them jointly."
- Multi-teacher knowledge distillation: Compressing several teacher models into a single student by matching their output distributions. "We cast model merging as multi-teacher knowledge distillation"
- Neural Tangent Kernel (NTK): A kernel that approximates neural network behavior via linearization around a reference point. "NTK approximation as done in Jacot et al. [2020]"
- Non-vacuous bounds: Generalization bounds that provide meaningful (non-trivial) guarantees rather than loose or uninformative ones. "yielding non-vacuous bounds [Neyshabur et al., 2017, Petzka et al., 2021, Dziugaite and Roy, 2017]."
- PAC-Bayes generalization bound: A probabilistic bound on generalization error leveraging Bayesian priors/posteriors over parameters. "we establish a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting."
- Pareto-optimal trade-offs: Solutions in multi-objective optimization where improving one objective would worsen another. "seek Pareto-optimal trade-offs with convergence guarantees and controllable preferences [Lin et al., 2019, Shamsian et al., 2023]."
- Posterior concentration: The phenomenon of a posterior distribution placing most mass in a region of low loss, aiding tighter bounds. "such that when weight posterior concen- trates in a broad low-loss region, complexity terms shrink, yielding non-vacuous bounds"
- Prior: A distribution over parameters before observing data, used in PAC-Bayes analysis. "Let P = N(pp, Ep) be a data-free prior on O"
- Posterior: A distribution over parameters after observing data; task-dependent in model merging. "for each task t let Qt = N (Alt, Et) be a task-dependent posterior."
- RegMean: A data-dependent merging method that regularizes averaging using feature Grammian inner-products. "RegMean/RegMean++ compute feature inner-product Gram to regularize averaging"
- RegMean++: An enhanced version of RegMean with improved regularization for merging. "RegMean/RegMean++ compute feature inner-product Gram to regularize averaging"
- Sharpness-Aware Minimization (SAM): An optimization method that penalizes worst-case loss in a local neighborhood to find flatter minima. "Sharpness-Aware Minimization (SAM) [Foret et al., 2021] achieves flatter minima by penalizing the worst-case loss in a neighborhood"
- SAMerging: The proposed method that merges models via multi-teacher distillation and SAM to seek flat minima. "SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima."
- Task arithmetic: A merging approach that treats fine-tuned model offsets from pretrained weights as vectors that can be scaled and summed. "One line of methods in model merging is based on the notion of "task arithmetic" [Ilharco et al., 2023], which treats each fine-tuned model's offset from the pretrained weights as a task vector."
- Task vector: The parameter offset of a fine-tuned model from the pretrained weights, used as a direction in merging. "treats each fine-tuned model's offset from the pretrained weights as a task vector."
- TIES-Merging: A merging method designed to resolve interference when combining models. "TIES-Merging [Yadav et al., 2023]"
Collections
Sign up for free to add this paper to one or more collections.