Category-Expert Souping (SoCE): Efficient Integration
- Category-Expert Souping (SoCE) is a method that integrates specialized models using non-uniform, data-driven weight averaging to optimize performance across weakly correlated task categories.
- It selects domain experts based on per-category performance and employs a grid search over weight vectors to construct a robust combined model.
- Empirical results show that SoCE outperforms uniform model souping and standard ensembles, achieving state-of-the-art benchmarks on tasks like BFCL and MGSM.
Category-Expert Souping (SoCE) is a principled approach for constructing high-performing models by leveraging the specialization of domain or category experts and combining them through non-uniform, data-driven weight averaging. The SoCE paradigm emerged as a resource- and compute-efficient alternative to traditional approaches such as uniform model souping and model ensembling, aiming to maximize overall performance and robustness in multi-category machine learning settings, especially LLMs.
1. Motivation and Empirical Observations
Traditional model souping—uniformly averaging the weights of several models—was motivated by empirical findings that such merging can sometimes improve generalization, but often yields modest gains or even negative transfer, especially when individual models are best-in-class for only certain sub-tasks. The SoCE approach was developed in response to two key empirical findings:
- Task specialization in models: Fine-tuned models often become “experts” on specific sub-categories of a benchmark, excelling in narrow functions but underperforming elsewhere.
- Low inter-category performance correlation: On composite benchmarks, the performance of models across task categories often shows weak or negative correlations. For example, on the Berkeley Function Calling Leaderboard (BFCL), “Multi-turn-base” and “Live Accuracy” categories have a Pearson correlation , indicating nearly orthogonal competencies. In contrast, closely related categories (e.g., multi-turn function-calling variants) have high positive correlations (–$0.98$) (Maiti et al., 17 Nov 2025).
Recognizing and exploiting the weak correlation structure enables SoCE to assemble a model from mutually complementary experts, rather than diluting task-specific strength through uniform averaging.
2. Formal Framework and Mathematical Construction
Given a benchmark with sub-categories , and a pool of candidate models , SoCE proceeds as follows:
- Correlation analysis: Compute the Pearson correlation matrix , with elements measuring the correlation of model performances on . Categories with (with threshold ) are identified as “weakly correlated.”
- Expert selection: For each weakly correlated category , select the “expert” model .
- Non-uniform weighted averaging: Form the soup model
with weights and . Optimize to maximize aggregate performance:
Discrete grid search is used for (plus the uniform case).
This non-uniform, convex combination targets enhanced global performance, as opposed to the suboptimality of uniform-weighted soups (Maiti et al., 17 Nov 2025).
3. Algorithmic Implementation and Complexity
The SoCE meta-algorithm:
- Input: Benchmark (categories ); models ; correlation threshold .
- Step 1: Compute Pearson matrix , form set of weakly correlated categories.
- Step 2: For each , identify expert model .
- Step 3: Perform grid search over weight vectors on the simplex; for each , build and evaluate performance.
- Step 4: Output .
Complexity:
- Correlation calculation: .
- Expert selection: .
- Grid search: (practical for moderate ; typically 3–5).
- Each candidate weight vector requires performance evaluations.
No additional fine-tuning or weight alignment is performed post-merge; full-precision weights are combined directly.
4. Empirical Validation and Comparative Performance
SoCE has been validated on diverse benchmarks and model sizes:
| Benchmark | Model Pool | Best Single | Uniform Soup | Uniform+Selection | SoCE (best) |
|---|---|---|---|---|---|
| BFCL (70B) | xLAM-2-70b, CoALM-70B, ... | 78.56% | 68.33% | 78.40% | 80.68% |
| BFCL (8B) | xLAM-2-8b, ToolACE-8B, ... | 72.37% | 69.80% | 74.01% | 76.50% |
| MGSM (7B) | MathOctopus variants | 50.9% | 47.0% | 47.8% | 51.7% |
| ∞-Bench (70B) | LLaMA-3 derived (5 models) | 27.44% | 27.44% | 27.85% | 28.00% |
On BFCL 70B, the selection step adds absolute improvement over uniform soup, and non-uniform weight tuning adds another , achieving a new state-of-the-art of (previous SOTA: ) (Maiti et al., 17 Nov 2025). On MGSM, SoCE surpasses the best single model by $0.8$ points, whereas uniform souping actually induces regression.
Ablation studies confirm that the two main ingredients—(i) selection of specialists and (ii) non-uniform weighting—each provide significant and additive performance gains.
5. Extensions, Related Methods, and Generalizations
SoCE can be considered a specific instantiation of the general model souping concept, with distinguishing aspects:
- Benchmarks as category decompositions: SoCE assumes a benchmark with semantically meaningful, weakly-correlated sub-categories.
- Selection phase: Rather than merging all available models, only those maximizing per-category (anti-)specialization are chosen.
- Weight optimization: Empirically-tuned, non-uniform convex averaging achieves optimal integration of expertise.
Related approaches include:
- Soup-of-Experts (Ablin et al., 3 Feb 2025): A general method for amortizing specialist selection, where a bank of expert parameter vectors allows instantiating arbitrary specialists via a learned, input-conditioned convex combination. The coefficients for combining experts are given by a small neural network over the desired domain mixture , yielding a parameter vector that is adapted to any chosen mixture at inference. Amortized pretraining provides massive speedup for deploying many customized specialist models at fixed parameter budgets.
- SoupLM (Bai et al., 2024): Proposes several model-soup strategies (Vanilla Soup—uniform averaging; Learnable Soup—module-wise learnable weights; Regularized Soup—with sparsity-promoting penalties) for integrating models specialized on different modalities or tasks, notably for combining language and vision-language LLMs. The learnable soup variant confirms that domain-specific knowledge concentrates in specific sub-modules, supporting the core SoCE rationale.
Multistrate evidence fusion: The SoCE principle is also exemplified in the computational social science study of public and expert Twitter discourse (Manikonda et al., 2017), where merging category-level (lay) and expert signals yields insights unattainable by either view alone. This demonstrates that SoCE concepts generalize to evidence combination pipelines outside of neural parameter interpolation.
6. Limitations, Practical Constraints, and Deployment
- Category structure: SoCE demands benchmarks with meaningful, sufficiently uncorrelated sub-categories and availability of a diverse pool of models exhibiting specialization.
- Grid search scalability: The grid search for non-uniform weights becomes combinatorially expensive as the number of experts (|L|) grows.
- Homogeneity assumption: SoCE has, to date, been applied only to models with matched architectures and compatible parameterizations; naïve merging across different architectures or pretraining initializations remains unexplored.
- Estimate reliability: Accurate estimation of inter-category correlation and per-category performance is required, necessitating large numbers of candidate models and robust evaluation protocols.
- Inference time: The deployed model is a single set of parameters, with no added runtime cost compared to baselines.
7. Future Directions and Open Challenges
Several prospective avenues are proposed for the extension and refinement of Category-Expert Souping:
- Automatic category clustering: Systematic discovery or unsupervised clustering of weakly/anti-correlated sub-categories, avoiding prespecified benchmark splits.
- Continuous/grid-free weight optimization: Moving beyond discrete grid search to gradient-based, end-to-end learning of optimal soup weights.
- Amortized/conditional specialization: Adopting the Soup-of-Experts paradigm, where a conditional function outputs weights for expert interpolation, extending the flexibility to arbitrary, user-defined domain mixtures.
- Integration with Mixture-of-Experts and adapter models: Applying SoCE strategies within multi-task or conditional architectures, merging both full models and more granular adapters or routing subnetworks.
- Cross-architecture and modality generalization: Enabling SoCE-style methods where the candidate experts differ substantially in backbone architecture, initialization, or task (e.g., language–vision–speech integration).
A plausible implication is that continued expansion of SoCE concepts will enable the cost-effective deployment of families of highly specialized, robust, and small-footprint models, spanning both structured benchmarks and complex, messy real-world tasks.
Key References:
- "Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance" (Maiti et al., 17 Nov 2025)
- "Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging" (Ablin et al., 3 Feb 2025)
- "SoupLM: Model Integration in Large Language and Multi-Modal Models" (Bai et al., 2024)
- "Tweeting AI: Perceptions of Lay vs Expert Twitterati" (Manikonda et al., 2017)