Category-Expert Souping (SoCE): Efficient Integration

Updated 4 February 2026

Category-Expert Souping (SoCE) is a method that integrates specialized models using non-uniform, data-driven weight averaging to optimize performance across weakly correlated task categories.
It selects domain experts based on per-category performance and employs a grid search over weight vectors to construct a robust combined model.
Empirical results show that SoCE outperforms uniform model souping and standard ensembles, achieving state-of-the-art benchmarks on tasks like BFCL and MGSM.

Category-Expert Souping (SoCE) is a principled approach for constructing high-performing models by leveraging the specialization of domain or category experts and combining them through non-uniform, data-driven weight averaging. The SoCE paradigm emerged as a resource- and compute-efficient alternative to traditional approaches such as uniform model souping and model ensembling, aiming to maximize overall performance and robustness in multi-category machine learning settings, especially LLMs.

1. Motivation and Empirical Observations

Traditional model souping—uniformly averaging the weights of several models—was motivated by empirical findings that such merging can sometimes improve generalization, but often yields modest gains or even negative transfer, especially when individual models are best-in-class for only certain sub-tasks. The SoCE approach was developed in response to two key empirical findings:

Task specialization in models: Fine-tuned models often become “experts” on specific sub-categories of a benchmark, excelling in narrow functions but underperforming elsewhere.
Low inter-category performance correlation: On composite benchmarks, the performance of models across task categories often shows weak or negative correlations. For example, on the Berkeley Function Calling Leaderboard (BFCL), “Multi-turn-base” and “Live Accuracy” categories have a Pearson correlation $\rho \approx 0.07$ , indicating nearly orthogonal competencies. In contrast, closely related categories (e.g., multi-turn function-calling variants) have high positive correlations ( $\rho \approx 0.96$ –$0.98$) (Maiti et al., 17 Nov 2025).

Recognizing and exploiting the weak correlation structure enables SoCE to assemble a model from mutually complementary experts, rather than diluting task-specific strength through uniform averaging.

2. Formal Framework and Mathematical Construction

Given a benchmark $D$ with $k$ sub-categories $\{C_1,\ldots,C_k\}$ , and a pool of $n$ candidate models $M = \{ M_1,\ldots,M_n \}$ , SoCE proceeds as follows:

Correlation analysis: Compute the $k \times k$ Pearson correlation matrix $R$ , with elements $\rho_{i,j}$ measuring the correlation of model performances on $(C_i, C_j)$ . Categories with $\min_{m\ne \ell} | \rho_{\ell,m} | < \tau$ (with threshold $\tau \approx 0.3$ ) are identified as “weakly correlated.”
Expert selection: For each weakly correlated category $C_i$ , select the “expert” model $M_i^* = \arg\max_{M_j \in M} Perf(M_j, C_i)$ .
Non-uniform weighted averaging: Form the soup model

$M_{soup}(w) = \sum_{i=1}^{|L|} w_i M_i^*,$

with weights $w_i \ge 0$ and $\sum_i w_i = 1$ . Optimize $w$ to maximize aggregate performance:

$w^* = \arg\max_{w \in \Delta} \sum_{\ell=1}^k Perf(M_{soup}(w), C_\ell).$

Discrete grid search is used for $w \in \{0.1, 0.2, \ldots, 0.9\}^{|L|}$ (plus the uniform case).

This non-uniform, convex combination targets enhanced global performance, as opposed to the suboptimality of uniform-weighted soups (Maiti et al., 17 Nov 2025).

3. Algorithmic Implementation and Complexity

The SoCE meta-algorithm:

Input: Benchmark $D$ (categories $C_1,\ldots,C_k$ ); models $M_1,\ldots,M_n$ ; correlation threshold $\tau$ .
Step 1: Compute Pearson matrix $R$ , form set $L$ of weakly correlated categories.
Step 2: For each $C_i \in L$ , identify expert model $M_i^*$ .
Step 3: Perform grid search over weight vectors $w$ on the simplex; for each $w$ , build $M_{soup}(w)$ and evaluate performance.
Step 4: Output $M_{soup}(w^*)$ .

Complexity:

Correlation calculation: $O(k^2 n)$ .
Expert selection: $O(|L| n)$ .
Grid search: $O((1/0.1)^{|L|-1})$ (practical for moderate $|L|$ ; typically 3–5).
Each candidate weight vector requires $k$ performance evaluations.

No additional fine-tuning or weight alignment is performed post-merge; full-precision weights are combined directly.

4. Empirical Validation and Comparative Performance

SoCE has been validated on diverse benchmarks and model sizes:

Benchmark	Model Pool	Best Single	Uniform Soup	Uniform+Selection	SoCE (best)
BFCL (70B)	xLAM-2-70b, CoALM-70B, ...	78.56%	68.33%	78.40%	80.68%
BFCL (8B)	xLAM-2-8b, ToolACE-8B, ...	72.37%	69.80%	74.01%	76.50%
MGSM (7B)	MathOctopus variants	50.9%	47.0%	47.8%	51.7%
∞-Bench (70B)	LLaMA-3 derived (5 models)	27.44%	27.44%	27.85%	28.00%

On BFCL 70B, the selection step adds $+9.9\%$ absolute improvement over uniform soup, and non-uniform weight tuning adds another $+2.3\%$ , achieving a new state-of-the-art of $80.68\%$ (previous SOTA: $78.56\%$ ) (Maiti et al., 17 Nov 2025). On MGSM, SoCE surpasses the best single model by $0.8$ points, whereas uniform souping actually induces regression.

Ablation studies confirm that the two main ingredients—(i) selection of specialists and (ii) non-uniform weighting—each provide significant and additive performance gains.

SoCE can be considered a specific instantiation of the general model souping concept, with distinguishing aspects:

Benchmarks as category decompositions: SoCE assumes a benchmark with semantically meaningful, weakly-correlated sub-categories.
Selection phase: Rather than merging all available models, only those maximizing per-category (anti-)specialization are chosen.
Weight optimization: Empirically-tuned, non-uniform convex averaging achieves optimal integration of expertise.

Related approaches include:

Soup-of-Experts (Ablin et al., 3 Feb 2025): A general method for amortizing specialist selection, where a bank of expert parameter vectors allows instantiating arbitrary specialists via a learned, input-conditioned convex combination. The coefficients $\alpha$ for combining $n$ experts are given by a small neural network $\phi_\omega(h)$ over the desired domain mixture $h$ , yielding a parameter vector $\Theta(h) = S + \sum_{j=1}^n \alpha_j E_j$ that is adapted to any chosen mixture at inference. Amortized pretraining provides massive speedup for deploying many customized specialist models at fixed parameter budgets.
SoupLM (Bai et al., 2024): Proposes several model-soup strategies (Vanilla Soup—uniform averaging; Learnable Soup—module-wise learnable weights; Regularized Soup—with sparsity-promoting penalties) for integrating models specialized on different modalities or tasks, notably for combining language and vision-language LLMs. The learnable soup variant confirms that domain-specific knowledge concentrates in specific sub-modules, supporting the core SoCE rationale.

Multistrate evidence fusion: The SoCE principle is also exemplified in the computational social science study of public and expert Twitter discourse (Manikonda et al., 2017), where merging category-level (lay) and expert signals yields insights unattainable by either view alone. This demonstrates that SoCE concepts generalize to evidence combination pipelines outside of neural parameter interpolation.

6. Limitations, Practical Constraints, and Deployment

Category structure: SoCE demands benchmarks with meaningful, sufficiently uncorrelated sub-categories and availability of a diverse pool of models exhibiting specialization.
Grid search scalability: The grid search for non-uniform weights becomes combinatorially expensive as the number of experts (|L|) grows.
Homogeneity assumption: SoCE has, to date, been applied only to models with matched architectures and compatible parameterizations; naïve merging across different architectures or pretraining initializations remains unexplored.
Estimate reliability: Accurate estimation of inter-category correlation and per-category performance is required, necessitating large numbers of candidate models and robust evaluation protocols.
Inference time: The deployed model is a single set of parameters, with no added runtime cost compared to baselines.

7. Future Directions and Open Challenges

Several prospective avenues are proposed for the extension and refinement of Category-Expert Souping:

Automatic category clustering: Systematic discovery or unsupervised clustering of weakly/anti-correlated sub-categories, avoiding prespecified benchmark splits.
Continuous/grid-free weight optimization: Moving beyond discrete grid search to gradient-based, end-to-end learning of optimal soup weights.
Amortized/conditional specialization: Adopting the Soup-of-Experts paradigm, where a conditional function outputs weights for expert interpolation, extending the flexibility to arbitrary, user-defined domain mixtures.
Integration with Mixture-of-Experts and adapter models: Applying SoCE strategies within multi-task or conditional architectures, merging both full models and more granular adapters or routing subnetworks.
Cross-architecture and modality generalization: Enabling SoCE-style methods where the candidate experts differ substantially in backbone architecture, initialization, or task (e.g., language–vision–speech integration).

A plausible implication is that continued expansion of SoCE concepts will enable the cost-effective deployment of families of highly specialized, robust, and small-footprint models, spanning both structured benchmarks and complex, messy real-world tasks.

Key References:

"Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance" (Maiti et al., 17 Nov 2025)
"Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging" (Ablin et al., 3 Feb 2025)
"SoupLM: Model Integration in Large Language and Multi-Modal Models" (Bai et al., 2024)
"Tweeting AI: Perceptions of Lay vs Expert Twitterati" (Manikonda et al., 2017)

Markdown Report Issue Upgrade to Chat

References (4)

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance (2025)

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging (2025)

SoupLM: Model Integration in Large Language and Multi-Modal Models (2024)

Tweeting AI: Perceptions of Lay vs Expert Twitterati (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Category-Expert Souping (SoCE).

Category-Expert Souping (SoCE): Efficient Integration

1. Motivation and Empirical Observations

2. Formal Framework and Mathematical Construction

3. Algorithmic Implementation and Complexity

4. Empirical Validation and Comparative Performance

6. Limitations, Practical Constraints, and Deployment

7. Future Directions and Open Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Category-Expert Souping (SoCE): Efficient Integration

1. Motivation and Empirical Observations

2. Formal Framework and Mathematical Construction

3. Algorithmic Implementation and Complexity

4. Empirical Validation and Comparative Performance

5. Extensions, Related Methods, and Generalizations

6. Limitations, Practical Constraints, and Deployment

7. Future Directions and Open Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research