Asymptotically Optimal Sequential Testing with Heterogeneous LLMs

Published 1 Apr 2026 in cs.DS, cs.IT, and math.ST | (2604.01086v2)

Abstract: We study a Bayesian binary sequential hypothesis testing problem with multiple LLMs. Each LLM $j$ has per-query cost $c_j>0$, random waiting time with mean $μj>0$ and sub-Gaussian tails, and \emph{asymmetric} accuracies: the probability of returning the correct label depends on the true hypothesis $θ\in{A,B}$ and needs not be the same under $A$ and $B$. This asymmetry induces two distinct information rates $(I{j,A}, I_{j,B})$ per LLM, one under each hypothesis. The decision-maker chooses LLMs sequentially, observes their noisy binary answers, and stops when the posterior probability of one hypothesis exceeds $1-α$. The objective is to minimize the sum of expected query cost and expected waiting cost, $\mathbb{E}[C_π] + \mathbb{E}[g(W_π)]$, where $C_π$ is the total query cost, $W_π$ is the total waiting time and $g$ is a polynomial function (e.g., $g(x)=x^ρ$ with $ρ\ge 1$). We prove that as the error tolerance $α\to0$, the optimal policy is asymptotically equivalent to one that uses at most two LLMs. In this case, a single-LLM policy is \emph{not} generically optimal: optimality now requires exploiting a two-dimensional tradeoff between information under $A$ and information under $B$. Any admissible policy induces an expected information-allocation vector in $\mathbb{R}_+^2$, and we show that the optimal allocation lies at an extreme point of the associated convex set when $α$ is relatively small, and hence uses at most two LLMs. We construct belief-dependent policies that first mix between two LLMs when the posterior is ambiguous, and then switch to a single "specialist" LLM when the posterior is sufficiently close to one of the hypotheses. These policies match the universal lower bound up to a $(1+o(1))$ factor as $α\rightarrow 0$.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper introduces a Bayesian sequential testing framework that optimally orchestrates heterogeneous LLMs for binary hypothesis evaluation.
It develops a convex optimization model showing that only two specialized LLMs are sufficient to balance cost, latency, and accuracy under high-confidence settings.
The method integrates cost, latency, and error asymmetries to deliver scalable, real-time orchestration applicable to systems like content moderation and fraud detection.

Asymptotically Optimal Sequential Testing with Heterogeneous LLMs

Problem Formulation and Motivation

The paper develops a Bayesian binary hypothesis testing framework for sequentially orchestrating multiple heterogeneous LLMs as information sources. Each LLM is parameterized by distinct per-query cost, sub-Gaussian response time (latency), and asymmetric accuracies for the two hypotheses, leading to the core quantities: $(I_{j,A}, I_{j,B})$ , the expected information gain under each hypothesis for model $j$ . This setup directly models practical settings such as test-time orchestration of diverse LLM APIs, layered content moderation, and fraud detection pipelines, where queries can be escalated among tools with varied reliability, cost, and latency profiles.

Classical literature on sequential testing (e.g., Wald’s SPRT) or adaptive experiment design (e.g., Chernoff’s or Naghshvar’s formulations) either assumes a homogeneous information source or does not address the full operational regime encountered in modern AI systems: source-specific monetary costs, random and nontrivial latency, and significantly asymmetric error profiles. The authors motivate the need to transcend one-shot model selection—replacing it with online sequential orchestration—by highlighting that real-world systems (e.g., OpenAI’s GPT-n routers, payment-fraud layers, agentic LLM pipelines) measure efficiency by balancing correctness, cost, and latency.

Figure 1: Left: Output speed distribution for leading LLMs; Right: Intelligence Index for LLMs on standard benchmarks, showing pronounced heterogeneity relevant for orchestration policies.

Sequential Policy Structure and Theoretical Contributions

The agent’s policy consists of: at each time, selecting an LLM $j_t$ , observing its noisy binary response, and accumulating the log-likelihood ratio (LLR); stopping occurs once the posterior probability of either hypothesis exceeds $1-\alpha$ . The overall risk/cost includes both cumulative monetary expenditure and a convex-in-wait penalty (e.g., polynomial in total latency), modeling settings with strict delay constraints.

Information Rates and Log-Likelihood Dynamics

Denote priors on $\theta\in\{A,B\}$ by $(\xi_A,\xi_B)$ , and LLM $j$ 's accuracies by $(\gamma_{j,A},\gamma_{j,B})$ .
For selection history $j_{1:t}$ and outcomes $Y_{1:t}$ , the LLR is $j$ 0, with per-model increments $j$ 1 varying by ground truth, as LLM diagnostics are asymmetric.
The drift under each hypothesis for model $j$ 2 is $j$ 3 and $j$ 4. These provide a two-dimensional information efficiency profile, crucial for sequential allocation.

Universal Lower Bound via Convex Information Allocation

The paper’s first technical result is a universal lower bound on attainable risk for any admissible policy: asymptotic risk as $j$ 5 is governed by solving a deterministic convex allocation over the $j$ 6 available LLMs to satisfy information budget constraints for both hypotheses. The minimal expected cost/penalty achievable is the value of the following program:

$j$ 7

where $j$ 8 are the expected queries to LLM $j$ 9 under each hypothesis, and $j_t$ 0 encodes total expected cost and delay. The core insight: this convex program’s structure admits an optimal solution using at most two nonzero allocations—corresponding to at most two distinct LLMs.

Policy Class and Asymptotically Optimal Structure

The principal positive result identifies a sign-based policy $j_t$ 1: the agent tracks the cumulative LLR and selects the “ $j_t$ 2-specialist” when belief favors $j_t$ 3 (LLR $j_t$ 4) and the “ $j_t$ 5-specialist” when it favors $j_t$ 6. Upon entering the high-confidence region (LLR crossing a threshold), the policy effectively acts as a single-source SPRT with the optimal specialist, minimizing additional cost. Switching between two sources is necessary in generic settings due to information asymmetry and cost heterogeneity. The policy’s asymptotic risk matches the universal lower bound up to $j_t$ 7 for a latency cost penalty of order $j_t$ 8, with this remainder being minimax optimal among all admissible policies.

Detailed Mathematical and Algorithmic Insights

Key Mathematical Developments

LLR Threshold Structure: Stopping is characterized by LLR exceeding $j_t$ 9 or $1-\alpha$ 0, with thresholds depending on prior odds and the error tolerance.
Information Budget Constraints: All admissible policies must collect at least $1-\alpha$ 1 (resp. $1-\alpha$ 2) expected information under $1-\alpha$ 3 (resp. $1-\alpha$ 4), up to $1-\alpha$ 5 corrections factoring in overshoot.
Sparse Allocation Property: Via KKT analysis of the convex program, the authors prove that for any cost-risk surface defined by monotone convex $1-\alpha$ 6, optimal allocations have at most one nonzero $1-\alpha$ 7 and one $1-\alpha$ 8 for each hypothesis; if there is non-degeneracy (no exact resource ties), pure specialist policies are strictly optimal.
Martingale and Concentration Analysis: Fine-grained concentration, including Freedman-type inequalities on LLR martingale parts, is used to control the random overshoot and additional cost induced by switching or uncertain stopping times.

Consequences for LLM System Design

Only Two Sources Needed: For an optimal system, only two LLM APIs need to be orchestrated, regardless of the size of the available pool. The optimal specialists may differ for $1-\alpha$ 9 and $\theta\in\{A,B\}$ 0 directions, capturing cost and information asymmetry.
Practical Decision Guidance: The search for optimal orchestration reduces to identifying, for each possible true hypothesis, which LLM offers the best cost-adjusted information rate. Interleaving is only needed in the “ambiguous region” near the prior.
Scalability: The structural result demonstrates that system complexity need not scale with the pool of available LLMs—avoiding the need for complex routing trees or dynamic ensemble selection.

Numerical and Operational Implications

The cost-risk scaling is controlled by the specified confidence $\theta\in\{A,B\}$ 1. If the waiting time penalty $\theta\in\{A,B\}$ 2 is polynomial (e.g., linear), the expected cost grows as $\theta\in\{A,B\}$ 3 in the high-confidence regime—matching the information-theoretic optimum—even in the presence of nontrivial, heavy-tailed latencies. For higher exponents, e.g., quadratic costs, the dominant asymptotic term becomes $\theta\in\{A,B\}$ 4. The analysis covers the regime relevant for real-time decision platforms, such as content moderation systems, automated safety/fraud escalation, and multi-agent reasoning orchestrations.

Broader Landscape, Contrasts, and Extensions

Contrast with Prior LLM Test-Time Compute Work: Earlier studies emphasize empirical strategies (sampling, debate, verification, mixture-of-experts) but lack structural guarantees for multi-LLM sequential allocation under operational constraints [snell2024scaling, wu2024inference]. The present paper offers both a sharp theoretical framework and actionable scheduling guidance.
Relation to Model Cascades and Routing: Previous literature on fixed model cascades or routing (e.g., FrugalGPT, RouteLLM) focuses on static or input-dependent one-shot assignments. The current sequential framework generalizes the setting to allow for interleaved querying with posterior-adaptive source selection and derives strong asymptotic optimality.
Connections to Operations Research for LLM Serving: The work complements parallel efforts in task-level and system-level scheduling for LLM inference (e.g., [ao2025optimizing, jaillet2025online]), by addressing the information-acquisition subproblem faced at test time by a single job or query instance.

Future Directions

The framework opens multiple avenues for generalization: extension to multi-hypothesis or structured output spaces, modeling fully continuous accuracy/cost/latency tradeoffs, and incorporating model-roster learning (where the statistical profile is not fully known a priori). Analyzing finite-sample nonasymptotics, adaptive or bandit settings with online source characteristic estimation, and robustification to adversarial (non-IID) response structures would advance the theoretical interface between sequential experiment design and practical LLM system operations.

Conclusion

This work establishes a comprehensive, information-theoretically sharp blueprint for sequential orchestration of heterogeneous LLMs in binary hypothesis testing. By dissecting the two-dimensional information-cost tradeoff and demonstrating that at most two LLM specialists are required for high-confidence optimality, the paper delivers both novel theoretical principles and powerful practical guidance for multi-model AI system design. The integration of martingale analysis, convex program allocations, and explicit operational constraints advances the state of the art in LLM deployment science.

Figure 1: Left: Distributions of LLM output speed. Right: Distribution of Intelligence Index for major LLMs, highlighting test-time heterogeneity that motivates sequential orchestration.

References:

See (2604.01086) for full details. For related works: [snell2024scaling], [wu2024inference], [ao2025optimizing], and [huang2026optimal].

Markdown Report Issue