Test-Time Adaptive Batched Ensemble Drafting (TABED)

Updated 4 February 2026

TABED is a technique that adaptively ensembles outputs from multiple models, dynamically allocating compute based on uncertainty and task diversity.
It employs word-granular adaptive batch drafting and cross-model fusion to synchronize candidate drafts and trigger early stopping when needed.
Empirical evaluations demonstrate that TABED improves accuracy and efficiency over static ensemble methods in both language and vision-language model applications.

Test-time Adaptive Batched Ensemble Drafting (TABED) is a class of test-time inference procedures for LLMs (and more generally, multi-modal generative models) that dynamically allocate computational resources, adaptively ensemble model outputs, and synchronize candidate draft generations in response to model uncertainty or task diversity. Developed to improve reliability, efficiency, and robustness over fixed-budget or static ensemble approaches, TABED explicitly structures the use of model ensembles, batching, and adaptive stopping criteria during inference. This approach has been demonstrated both in general LLM applications and in the context of speculative decoding for large vision-LLMs (LVLMs), as well as in the efficient approximation of best-of-N and best-of- $\infty$ selection regimes for LLMs (Cui et al., 9 Jan 2026, Lee et al., 28 Jan 2026, Komiyama et al., 25 Sep 2025).

1. Foundations and Motivating Limitations of Static Ensembles

Traditional ensemble methods for LLM inference aggregate outputs from multiple models, typically via majority voting, averaging of log-likelihoods, or mixture-of-experts selection. However, static approaches—fixed granularity fusion, unvarying ensemble weights, or non-adaptive prompting—are limited by their inability to address instantaneous uncertainty, local context sensitivity, or scenario-specific variations. Moreover, fixed-size best-of-N generations impose unnecessarily high computational cost in easy cases and underperform in hard cases due to lack of adaptive sample allocation. These limitations motivate TABED’s core principles: dynamic candidate batching, uncertainty-driven ensembling, adaptive prompt/model selection, and early stopping.

2. Formal Workflow and Algorithmic Structure

2.1 Word-Granular Adaptive Batch Drafting

In approaches such as AdaFuse, each decoding round at test time operates at word granularity. Beginning from the current prefix, each model in the ensemble emits a candidate “word” (the span up to the next whitespace) (Cui et al., 9 Jan 2026). Decoding proceeds as follows:

Low-uncertainty: All models greedily commit up to $M$ consecutive words.
High-uncertainty: Upon detecting low model confidence (via a margin or entropy criterion at the start of each word), each model explores a batch of $B$ diverse candidate words in parallel. Batches are formed by selecting top- $B$ candidates from the predicted next-token distribution, then greedily extending each to a full word.

2.2 Dynamic Trigger via Model Uncertainty

Batch drafting is triggered by a test-time uncertainty criterion:

For model $m_k$ with next-token probabilities $p_{m_k}^{(t)}(v)$ at time $t$ , compute the margin $\Delta_{m_k}(y) = p_{(1)} - p_{(2)}$ (where $p_{(1)}$ and $p_{(2)}$ are the top probabilities).
If $\Delta_{m_k}(y) < \tau_\Delta$ (for threshold $\tau_\Delta$ ), or equivalently if entropy exceeds a threshold, branching (batch drafting) is initiated.

2.3 Cross-model Fusion

Upon batch formation, a pool $S$ of all candidate word spans from all models is assembled. For each candidate span $s$ , compute its normalized negative log-likelihood (NLL) under each model:

$\mathrm{NLL}_{m_k}(s) = -\frac{1}{|s|} \sum_{t=1}^{|s|} \log p_{m_k}(z_t \mid \text{prefix}, z_{<t}),$

and aggregate across models by averaging:

$F(s) = \frac{1}{K} \sum_{k=1}^K \mathrm{NLL}_{m_k}(s).$

The candidate with minimum $F(s)$ is committed to the output.

2.4 Pseudocode Summary

A representative pseudocode for AdaFuse’s implementation:

Input: prompt P, models {m_k}, margin-threshold τ_Δ, max-words-per-round M, branching-factor B
y ← P
while not end-of-sequence(y):
    S ← ∅
    for k in 1…K:
        s_k ← []
        c ← 0
        while c < M:
            w ← GenWordGreedy(m_k, y ∥ s_k)
            Δ ← compute_margin(p_{m_k}(·| y ∥ s_k))
            if Δ ≥ τ_Δ:
                s_k.append(w); c += 1
            else:
                {w^{(b)}}_{b=1}^B ← GenWordBatch(m_k, y ∥ s_k, B)
                s_k.extend({w^{(b)}})
                break
        S ← S ∪ {s_k}
    for s in S:
        F(s) ← (1/K) ∑_{k=1}^K NLL_{m_k}(s)
    s* ← argmin_s F(s)
    y ← y ∥ s*
return y

Typical hyperparameters include

\tau_\Delta = 0.7

M = 3

B = 3

(Cui et al., 9 Jan 2026).

3. Test-Time Weighting, Adaptation, and Scenario Diversity

TABED generalizes to settings where ensembling must adapt to prompt variants and input modalities, as exemplified in LVLMs. For speculative decoding with various draft strategies (multimodal, text-only, caption-based), multiple candidate logits are generated in a single batch (Lee et al., 28 Jan 2026). The ensembled predictive distribution at token $t$ is

$q_t(\cdot \mid x, y_{<t}; w) = \sum_{i=1}^m w^{(i)} q_t^{(i)}(\cdot \mid x, y_{<t}),$

where weights $w$ are adaptively re-estimated online by minimizing KL divergence to the verified target model’s “soft labels” over a moving window of recent decoding steps.

TABED operates in a plug-and-play manner—requiring no retraining or fine-tuning—by utilizing past accepted tokens at each step to readjust ensemble weights using a divergence-based criterion. Batch inference across $m$ prompt variants is enabled through parameter sharing, yielding negligible computational overhead when the draft model is much smaller than the main model.

4. Theoretical Properties and Optimality Connections

4.1 Best-of-N and Best-of- $\infty$ Interpretation

TABED also provides a framework for approximating “best-of-N” and “best-of- $\infty$ ” performance in LLMs (Komiyama et al., 25 Sep 2025). For an LLM with output answer probabilities $p_j$ on a problem $q$ :

The majority-vote accuracy of best-of-N draws is

$\Pr[a^{(N)} = g_q] = \sum_{k=\lceil N/2 \rceil}^N \binom{N}{k} p^k (1-p)^{N-k},$

where $p$ is the probability of the gold answer $g_q$ .

In the $N \to \infty$ limit, majority vote almost surely returns the argmax of $p_j$ (law of large numbers), and the maximum achievable accuracy is bounded by this property over the problem set.

4.2 Adaptive Stopping via Bayesian Evidence

TABED recasts sample allocation as an adaptive process: samples are generated sequentially until Bayes factor evidence ( $BF$ ) exceeds a threshold, signifying high probability that the current empirical majority reflects the true model majority. This allows reallocation of compute, focusing more samples where uncertainty is high and stopping early when the output distribution is concentrated.

A Dirichlet-process prior on possible answers per problem leads to stopping rules that are asymptotically consistent—almost surely returning the best-of- $\infty$ answer as parameters grow (Komiyama et al., 25 Sep 2025).

4.3 Weighted Model Ensembles

TABED further extends to optimal test-time weighted ensembling across multiple models. The optimal weight vector $w$ is found by maximizing the number of correctly majority-voted answers (as a mixed-integer linear program), exploiting the polyhedral structure of the simplex constraints and answer indicator functions.

5. Empirical Results and Performance Characteristics

5.1 LLM Inference

For LLM generative tasks (open-domain question answering, arithmetic reasoning, machine translation), AdaFuse’s TABED yields an average relative improvement of 6.88% over the best non-adaptive ensemble baseline (SweetSpan). Detailed absolute gains include:

Benchmark	AdaFuse TABED	Baseline	Relative Gain (%)
NaturalQuestions	42.85%	38.95%	+10.01
SQuAD	90.15%	86.58%	+4.12
TriviaQA	82.17%	81.38%	+0.97
GSM8K	79.15%	67.63%	+17.03
Flores En→De (spBLEU)	39.83	37.56	+6.04
Flores De→En (spBLEU)	45.25	42.85	+5.60

Runtime is comparable to light-weight token-level methods (e.g., UniTE), and substantially faster than beam- or span-based ensembling (Cui et al., 9 Jan 2026).

5.2 LVLMs and Speculative Decoding

In LVLMs, TABED yields an average 4–5% gain in draft token acceptance block efficiency and a walltime speedup of 1.74× over standard autoregressive decoding. Adaptive selection across multimodal, text-only, and alternate prompt drafts ensures optimal sample allocation per scenario. Integration of token-tree verification and additional drafting candidates (caption, pooled multimodal) further increases efficiency and robustness (Lee et al., 28 Jan 2026).

5.3 Approximation to Best-of- $\infty$

Empirical results on heavy reasoning benchmarks (AIME2024/5, GPQA-Diamond, MATH500) indicate that TABED with adaptive stopping matches the accuracy of high-N best-of-N selection while requiring 2–5× fewer samples. Weighted model ensembles via MILP further lift asymptotic accuracy over the best single LLM baseline, with improvements from 90% to 93.3% reported on AIME2025 (Komiyama et al., 25 Sep 2025).

6. Extensions and Implementational Variants

TABED’s framework extends to:

Parameter-free adaptation to input scenario changes (e.g., multi-turn LVLM dialogs).
Plug-and-play interface with existing speculative decoding pipelines (requiring ~200 lines of code).
Integration with advanced verification strategies (e.g., token-tree speculative verification), increasing block efficiency (τ rises from ≈2.3 to as high as 3.39 for $d=3$ ).
Diverse prompt augmentations (caption, pooled, etc.) evaluated adaptively within the batched ensemble (Lee et al., 28 Jan 2026).

In all settings, the cost of adaptive batched search and ensemble scoring remains modest, dominated by the cost for verifying candidates in the main model.

7. Significance, Interpretative Considerations, and Limitations

TABED constitutes a unified, principled approach to efficient test-time compute allocation for LLMs and LVLMs. Its design ensures that expensive ensembling is invoked only when needed—under uncertainty, scenario shift, or detected multi-modality—resulting in substantial robustness improvements with negligible or moderate extra inference cost. The theoretical foundations connect adaptive inference to asymptotic majoritarian selection, optimal mixture-of-experts allocation, and Bayesian sample efficiency.

A plausible implication is that TABED can serve as a design template for scalable inference systems requiring both reliability and adaptability, especially in deployment scenarios with highly heterogeneous input distributions. A possible limitation is that the effectiveness of the uncertainty criteria and batch granularity thresholds depends on the calibration properties of the candidate models’ predictive distributions.

Overall, TABED operationalizes adaptive ensemble drafting and test-time model selection, achieving near-optimal tradeoffs between accuracy, robustness, and runtime by leveraging uncertainty-aware, scenario-adaptive, and batch-synchronized inference mechanisms (Cui et al., 9 Jan 2026, Lee et al., 28 Jan 2026, Komiyama et al., 25 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (3)

AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs (2026)

TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs (2026)

Best-of-$\infty$ -- Asymptotic Performance of Test-Time Compute (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Test-time Adaptive Batched Ensemble Drafting (TABED).