Test-Time Ensembling Strategy

Updated 31 January 2026

Test-time ensembling is a technique that aggregates outputs from multiple model states or input transformations at inference, enhancing performance and calibration.
It leverages weight-space, temporal, and input-space aggregation methods to achieve improved accuracy, robustness to distribution shifts, and resistance to adversarial attacks.
Adaptive weighting schemes and ensemble diversity optimize the trade-off between computational cost and predictive gains, supporting continual learning and uncertainty estimation.

A test-time ensembling strategy refers to any protocol that aggregates predictions, representations, or scores from multiple models, model states, or input transformations at inference time, rather than during training. Such strategies systematically leverage redundancy and diversity to drive gains in accuracy, uncertainty estimation, robustness to distributional shift, stability in continual learning, calibration, and resilience against adversarial perturbations. Contemporary approaches cover weight-space ensembling, input transformation aggregation, latent-space perturbation, efficient snapshotting, model portfolio selection, and adaptive fusion in reasoning systems. This entry synthesizes key classes and technical formalizations from recent arXiv literature, illuminating the modern landscape of test-time ensembling.

1. Mathematical Principles and General Frameworks

The prototypical test-time ensemble forms its prediction as an aggregation over $K$ constituent sources—typically either distinct models, parameterizations, or input variants. For classification,

$\hat{p}_{\rm ens}(y | x) = \frac{1}{K} \sum_{k=1}^K p(y | x, \omega_k)$

where $\omega_k$ indexes either model weights (as in deep ensembles (Ashukha et al., 2020), snapshot ensembles (Proscura et al., 2022), FGE/PFGE (Guo et al., 2022)), configuration, or input transformation (augmentation, translation, generative view). More generally, weighted aggregation is used:

$\hat{p}_{\rm ens}(y | x) = \sum_{k=1}^K w_k \; p(y | x, \omega_k)$

with $\sum_{k=1}^K w_k = 1$ .

For representation ensembling (as in MeTTA (Ashukha et al., 2021)):

$\hat{z}(x) = \frac{1}{M} \sum_{m=1}^M f(t_m(x); w)$

for test-time augmentations $\{t_m\}$ . Ensemble aggregation may also occur in score space (QA spans (Ferritto et al., 2019)), logit space, or at the level of intermediate reasoning trajectories.

Augmentation and diversity are crucial; several works formally establish that ensemble improvement rates hinge on disagreement among members, measured by the disagreement-error ratio $R = D / E_{\rm avg}$ , which governs whether ensembling is effective (Theisen et al., 2023).

2. Weight-Space and Temporal Ensembling

Weight ensembling combines two or more parameter vectors at test time via convex combination or EMA. In ROID (Marsden et al., 2023),

$\theta_{t+1} = \alpha\, \theta_t + (1-\alpha)\, \theta_0$

where $\theta_0$ is the fixed source model and $\hat{p}_{\rm ens}(y | x) = \frac{1}{K} \sum_{k=1}^K p(y | x, \omega_k)$ 0 is adapted incrementally. This protects against catastrophic forgetting by retaining continual access to the original model basin.

For LLMs, WiSE-FT (Dang et al., 14 Apr 2025) interpolates early and late fine-tuned weights:

$\hat{p}_{\rm ens}(y | x) = \frac{1}{K} \sum_{k=1}^K p(y | x, \omega_k)$ 1

yielding improved Best@K and Pass@K for chain-of-thought scaling, with simultaneous reduction in bias and variance in the generation distribution.

EMA temporal ensembling (Soutif--Cormerais et al., 2023) for continual learning applies

$\hat{p}_{\rm ens}(y | x) = \frac{1}{K} \sum_{k=1}^K p(y | x, \omega_k)$ 2

where $\hat{p}_{\rm ens}(y | x) = \frac{1}{K} \sum_{k=1}^K p(y | x, \omega_k)$ 3 is the decay factor, leading to drastic gains in accuracy and stability with negligible compute cost. PFGE (Guo et al., 2022) further shows that parsimonious ensembles of weight-averaged models can match or surpass larger ensembles with a fraction of memory and inference cost.

3. Input/Representation-Space Aggregation

Test-time augmentation and input-space ensembling aggregate predictions over multiple transformed or translated versions of the input. Standard TTA (Shanmugam et al., 2020, Pérez et al., 2021) applies deterministic label-preserving transforms $\hat{p}_{\rm ens}(y | x) = \frac{1}{K} \sum_{k=1}^K p(y | x, \omega_k)$ 4 and averages logits or probabilities.

Advanced approaches address the limitations of uniform averaging. Adaptive learned aggregators (AugTTA, ClassTTA) (Shanmugam et al., 2020) replace simple means with transformation-class-dependent weights,

$\hat{p}_{\rm ens}(y | x) = \frac{1}{K} \sum_{k=1}^K p(y | x, \omega_k)$ 5

optimizing weights to minimize cross-entropy over a held-out validation set, thus mitigating systematic prediction flips under TTA.

Mean Embeddings with TTA (MeTTA) (Ashukha et al., 2021) averages pre-softmax embeddings over augmented views:

$\hat{p}_{\rm ens}(y | x) = \frac{1}{K} \sum_{k=1}^K p(y | x, \omega_k)$ 6

allowing representation-level invariance and up to +2.1% gain in ImageNet linear evaluation, with immediate integration into self-supervised pipelines.

Image-to-image translation ensembling (Scalbert et al., 2022) projects a test image into the style of each source domain using a generator (StarGANv2), then ensembles classifier outputs from each translated variant, optionally weighting by discriminator scores to select high-fidelity projections.

Deep generative view ensembling (Chai et al., 2021) uses latent-space perturbations (e.g., style mixing) with a pretrained GAN, forming soft aggregates over classifier outputs from each synthesized view, with gains subject to the quality of inversion and reconstruction.

Diffusion-based pseudo-label ensembling (Raman et al., 2023) applies a pretrained DDPM to denoise corrupted inputs, then constructs unsupervised pseudo-labels by averaging predictions over these denoised views, stabilizing self-training adaptation on target domains.

4. Ensemble Selection and Weighting Schemes

Optimal test-time performance often requires selective weighting, not uniform aggregation. For snapshot ensembles (Proscura et al., 2022), weights are set via monotonically increasing functions of training likelihood or inverse loss:

Direct likelihood weighting: $\hat{p}_{\rm ens}(y | x) = \frac{1}{K} \sum_{k=1}^K p(y | x, \omega_k)$ 7
Inverse-loss weighting: $\hat{p}_{\rm ens}(y | x) = \frac{1}{K} \sum_{k=1}^K p(y | x, \omega_k)$ 8
Temperature-scaled: $\hat{p}_{\rm ens}(y | x) = \frac{1}{K} \sum_{k=1}^K p(y | x, \omega_k)$ 9

Aggregated predictions are:

$\omega_k$ 0

Model portfolios (Chroma (Kayaalp et al., 7 Oct 2025)) in time-series forecasting select either a single best specialist or meta-learn greedy weights over specialists for bias reduction, outperforming naive uniform ensembling and achieving competitive performance with dramatically reduced parameter counts and inference cost.

Entropy-minimized adapter ensembles (EMEA (Wang et al., 2021)) optimize per-sentence adapter weights via entropy minimization over output probabilities, improving token-level F1 by up to 2.6 points over uniform and continual learning baselines.

QA ensemble strategies (Ferritto et al., 2019) aggregate candidate spans using Max, Exponential Sum, Reciprocal Rank Sum, and Noisy-Or; normalization via logistic regression on top-1 scores further calibrates cross-system confidence before ensemble averaging.

5. Adaptive and Diversity-Aware Fusion for LLM Reasoning

Test-time ensembling is pivotal for robust multi-path reasoning in LLMs. AdaFuse (Cui et al., 9 Jan 2026) for ensemble decoding dynamically fuses word-level continuations from multiple models, using a start-of-word confidence criterion to trigger either greedy expansion or diversity-aware scaling (branching exploration). Fused spans are scored across models by normalized NLL, achieving an average relative improvement of 6.88% across QA, reasoning, and translation tasks.

Mitigating reasoning strategy-selection bias (TTS-Uniform (Wu et al., 22 Sep 2025)) ensures that the inference budget is uniformly allocated across solution strategies, identified by coarse/fine-grained prompting or reasoning tree extraction. High-entropy strategies are filtered prior to majority vote, significantly boosting Pass@K and majority-vote accuracy.

6. Theoretical Analysis, Disagreement, and Calibration

Recent theory establishes rigorous criteria for the effectiveness of test-time ensembling. The ensemble improvement rate $\omega_k$ 1 is linearly bounded by the disagreement-error ratio $\omega_k$ 2:

$\omega_k$ 3

If $\omega_k$ 4 (disagreement below average error), ensemble gains are negligible and a single model suffices. For $\omega_k$ 5 (greater diversity), majority voting and adaptive aggregation yield substantial relative reductions in error (Theisen et al., 2023). These principles guide subset selection for multi-model pools and inform trade-offs between calibration, variance, and inference cost.

Multi-CLS BERT (Chang et al., 2022) demonstrates "mini-ensemble" behavior within a single transformer via multiple specialized CLS tokens, recovering the majority of a five-model ensemble’s benefits at a fraction of cost.

Uncertainty estimation for ensemble methods is best quantified on calibrated log-likelihood (CLL) or deep ensemble equivalent (DEE) score, not naive metrics such as ECE or Brier, which are both confounded by calibration bias and unaligned scoring (Ashukha et al., 2020). Performance plateaus in DEE signal optimal ensemble size for a given inference budget.

7. Robustness, Continual Learning, and Practical Considerations

Test-time transformation ensembling (TTE) (Pérez et al., 2021) strengthens empirical and certified adversarial robustness without retraining by aggregating over geometry-preserving transforms, e.g., flips and crops. In continual learning, EMA temporal ensembling and weight-space blend strategies (ROID (Marsden et al., 2023), PFGE (Guo et al., 2022)) protect against catastrophic forgetting and stability gaps.

Efficiency is a recurrent theme: PFGE, snapshot ensembling, Multi-CLS BERT, and Chroma portfolios attain ensemble-level generalization with drastically reduced memory and compute.

Best practices include:

Tuning ensemble size to maximize CLL and DEE
Adapting weights, selection, or fusion granularity per task and operational constraint
Using calibrating validation splits for aggregation weights
Avoiding ensemble construction in low-disagreement regimes

Test-time ensembling as a field spans weight-space methods, input/augmentation-space averaging, adaptive selection and fusion, and principled LLM reasoning scaling. It is theoretically and empirically justified as a foundational mechanism for robustness, uncertainty quantification, bias-variance tradeoff, and stated-of-the-art generalization under resource constraints.