EMA Model Soups: Weight Averaging for Robust Models

Updated 20 January 2026

Model soups are constructed by averaging weights from multiple fine-tuned models, achieving superior performance without added inference cost.
Uniform and greedy soups utilize convex combinations with EMA-smoothed checkpoints to stabilize training and improve generalization.
Empirical findings across image classification, NLP, and generative modeling demonstrate increased accuracy, robustness, and adaptive performance.

A model soup is a neural network whose weights are formed by averaging the parameters of multiple models fine-tuned from a common initialization under varying hyperparameters or distinct data subsets, yielding performance often superior to the best constituent model with unchanged inference and memory cost. The most widely studied algorithmic instantiation is the convex combination of model weights, notably uniform and greedy soups. Exponential Moving Averages (EMA) play a complementary role by providing smoothed and more stable candidate ingredients. Model soups have demonstrated efficacy across image classification, natural language processing, adversarial robustness, and generative modeling, notably with theoretical and empirical support for increased accuracy, robustness, and generalization capability (Wortsman et al., 2022, Croce et al., 2023, Biggs et al., 2024).

1. Formulation: Weight Averaging and Soup Construction

Suppose a pre-trained model $\theta_0$ is fine-tuned $k$ times under varying hyperparameters, yielding checkpoints $\theta_i$ for $i=1,\ldots,k$ . A model soup is given by a convex combination:

$\theta_\text{soup} = \sum_{i=1}^k \alpha_i \theta_i$

with $\alpha_i \geq 0$ , $\sum_{i=1}^k \alpha_i = 1$ . At inference, one uses $f(x; \theta_\text{soup})$ as with any single model, incurring no extra computational cost per sample.

Uniform soups allocate equal weight $\alpha_i=1/k$ , suitable when all models are in the same loss basin. Greedy soups iteratively add only those models that increase validation accuracy, guaranteeing at least the performance of the best constituent. In diffusion modeling (Biggs et al., 2024), merging checkpoints $W^{(i)}$ on sharded data yields

$W_\text{soup} = \sum_{i=1}^N k_i W^{(i)}, \quad k_i \geq 0, \ \sum_i k_i = 1$

with optimization of $k_i$ available through grid or greedy selection against a validation set or reverse-greedy pruning for refined composition.

2. Theoretical Insights: Loss Basin Geometry and Distributional Effects

In classification, averaging weights approximates the behavior of logit ensembles when constituent models lie in a flat, connected loss basin. The error gap can be analyzed via the second derivative of loss along the line segment of interpolated weights:

$L_\text{soup} - L_\text{ens} \approx c_\alpha\left[ -\frac{d^2 L_\text{soup}}{d\alpha^2} + \beta^2 \mathbb{E}_x \operatorname{Var}_{Y\sim \text{softmax}(\beta f(x; \theta_\alpha))}(\Delta f_Y(x)) \right]$

where $c_\alpha=\alpha(1-\alpha)/2$ . Flatness in the loss basin ( $-\frac{d^2L}{d\alpha^2}$ negative) favors the soup, while low variance in logits ensures close match to ensembling (Wortsman et al., 2022).

For generative diffusion models, weight averaging corresponds to sampling from the geometric mean of data densities encoded by constituent checkpoints, by first-order Taylor expansion (Biggs et al., 2024):

$q(x) \propto \left(\prod_{i=1}^N p^{(i)}(x)\right)^{1/N}$

This yields anti-memorization properties, mitigating the risk of regenerating rare or confidential training examples exclusive to one shard. For arithmetic mean mixtures, time-dependent reweighting of scores is necessary, which is computationally intensive and not typically pursued.

3. Empirical Findings Across Domains

Image Classification: Greedy soup combining CLIP ViT-B/32 models boosts ImageNet top-1 accuracy from 80.38% (best single) to 81.03%, with corresponding improvements in out-of-distribution robustness. ViT-G/14 greedy soup sets a new state-of-the-art at 90.94% (Wortsman et al., 2022).

Natural Language Processing: Greedy soup improves GLUE benchmark tasks (e.g., BERT-base achieves +0.7pp on RTE, T5-base +0.8pp on RTE).

Generative Modeling: Diffusion Soup on Stable Diffusion checkpoints improves Image Reward from 0.34 (union model) to 0.44 (domain soup) and from 0.37 to 0.59 (aesthetic shards), while TIFA rose from 85.5 to 86.8 (Biggs et al., 2024). Robust unlearning is demonstrated, with minor metric loss when shards are removed.

Adversarial Robustness: Seasoned model soups allow smooth trade-offs between $\ell_p$ -norm defenses, outperforming single defenses at various soup allocations. For instance, robust accuracies to different norm attacks on CIFAR-10 and ImageNet surpass those achieved by constituent or co-trained models; dataset-specific soup adaptation with 100 examples achieves near-optimal robust accuracy across distributional shifts (Croce et al., 2023).

4. Exponential Moving Averages (EMA) in Model Soups

EMA provides smoothed weight trajectories during fine-tuning, with decay factor $\beta$ (\textit{e.g.}, 0.999 or $0.9999999$). High-decay EMA yields strong single-model accuracy, whereas low-decay EMA offers superior ingredients for greedy soups. Incorporation of EMA-smoothed weights as soup ingredients introduces flatter-basin solutions, stabilizing interpolation and improving the geometric conditions underpinning theoretical approximation.

Strategies for EMA use in soups:

Construct soups from EMA weights rather than raw snapshots.
Apply EMA post-hoc to collections of soups (“soup-of-soups”).
Optimize mixing of raw and EMA checkpoints for enhanced diversity, using regularization to counteract over-parameterization (Wortsman et al., 2022).

Editor’s term: “EMA ingredients” refers to the deployment of EMA-smoothed checkpoints as candidate models for souping.

5. Robustness, Continual Learning, and Unlearning

Model soups support modular adaptation: addition of new threat-specialized or domain-specialized checkpoints without retraining, and immediate reweighting for desired trade-off. Removal of ingredients (unlearning) is achieved by adjusting the affine weights, preserving performance and stability in generative and classification domains (Biggs et al., 2024). Few-shot grid search on new distributions or threats allows for rapid construction of dataset-specific soups, leveraging alignment of weights in a shared pre-trained basin.

Adversarially robust soups incorporate extrapolation, where some coefficients may be negative or greater than unity, yielding improved robustness in certain regions of the simplex, beyond convex combinations (Croce et al., 2023).

6. Implementation and Practical Considerations

Soups are constructed by parameter-wise averaging over PyTorch state_dicts and require only a single pass over model weights. Final soup checkpoints are used for inference identically to any single model, with no increase in compute or memory. Public implementations are provided (Wortsman et al., 2022). Practical hyperparameters include step sizes for grid search ( $\Delta \alpha_i=0.2$ commonly), the number of epochs for fine-tuning constituents (typically 1–10), and selected threat budgets for robustness evaluation.

Memory overhead must be considered when storing both raw and EMA checkpoints, and selection of $\beta$ hyperparameter remains empirical. Model soup theory assumes constituent weights lie in a “tangent” space around a shared initialization, and efficacy may diminish if constituent checkpoints are highly divergent.

7. Limitations and Open Directions

Training cost is linear with the number of soup constituents; no reduction in pre-soup training burden.
Averaging is global; context-dependent or adaptive reweighting at inference requires dynamic recomputation.
Theoretical guarantees on geometric mean sampling are local and may not hold far from the common basin.
Extension of model soup methodology to additional modalities (audio, language) and use as a regularizer remains open.

A plausible implication is that further investigation into geometric mean regularization and local loss landscape flatness may reveal new unsupervised or cross-modal benefits.

Model soups, including EMA variants and applications in diffusion models, represent a principled, computationally efficient framework for model integration, robustification, and adaptation. Their success relies on the geometry of the loss landscape coupled with aligned fine-tuning procedures, and their scope covers a spectrum from pure classification accuracy to generalization across domains, adversarial security, and even anti-memorization properties in generative modeling.