Papers
Topics
Authors
Recent
Search
2000 character limit reached

Asymptotically Optimal Sequential Testing with Heterogeneous LLMs

Published 1 Apr 2026 in cs.DS, cs.IT, and math.ST | (2604.01086v2)

Abstract: We study a Bayesian binary sequential hypothesis testing problem with multiple LLMs. Each LLM $j$ has per-query cost $c_j>0$, random waiting time with mean $μj>0$ and sub-Gaussian tails, and \emph{asymmetric} accuracies: the probability of returning the correct label depends on the true hypothesis $θ\in{A,B}$ and needs not be the same under $A$ and $B$. This asymmetry induces two distinct information rates $(I{j,A}, I_{j,B})$ per LLM, one under each hypothesis. The decision-maker chooses LLMs sequentially, observes their noisy binary answers, and stops when the posterior probability of one hypothesis exceeds $1-α$. The objective is to minimize the sum of expected query cost and expected waiting cost, $\mathbb{E}[C_π] + \mathbb{E}[g(W_π)]$, where $C_π$ is the total query cost, $W_π$ is the total waiting time and $g$ is a polynomial function (e.g., $g(x)=xρ$ with $ρ\ge 1$). We prove that as the error tolerance $α\to0$, the optimal policy is asymptotically equivalent to one that uses at most two LLMs. In this case, a single-LLM policy is \emph{not} generically optimal: optimality now requires exploiting a two-dimensional tradeoff between information under $A$ and information under $B$. Any admissible policy induces an expected information-allocation vector in $\mathbb{R}_+2$, and we show that the optimal allocation lies at an extreme point of the associated convex set when $α$ is relatively small, and hence uses at most two LLMs. We construct belief-dependent policies that first mix between two LLMs when the posterior is ambiguous, and then switch to a single "specialist" LLM when the posterior is sufficiently close to one of the hypotheses. These policies match the universal lower bound up to a $(1+o(1))$ factor as $α\rightarrow 0$.

Summary

  • The paper introduces a Bayesian sequential testing framework that optimally orchestrates heterogeneous LLMs for binary hypothesis evaluation.
  • It develops a convex optimization model showing that only two specialized LLMs are sufficient to balance cost, latency, and accuracy under high-confidence settings.
  • The method integrates cost, latency, and error asymmetries to deliver scalable, real-time orchestration applicable to systems like content moderation and fraud detection.

Asymptotically Optimal Sequential Testing with Heterogeneous LLMs

Problem Formulation and Motivation

The paper develops a Bayesian binary hypothesis testing framework for sequentially orchestrating multiple heterogeneous LLMs as information sources. Each LLM is parameterized by distinct per-query cost, sub-Gaussian response time (latency), and asymmetric accuracies for the two hypotheses, leading to the core quantities: (Ij,A,Ij,B)(I_{j,A}, I_{j,B}), the expected information gain under each hypothesis for model jj. This setup directly models practical settings such as test-time orchestration of diverse LLM APIs, layered content moderation, and fraud detection pipelines, where queries can be escalated among tools with varied reliability, cost, and latency profiles.

Classical literature on sequential testing (e.g., Wald’s SPRT) or adaptive experiment design (e.g., Chernoff’s or Naghshvar’s formulations) either assumes a homogeneous information source or does not address the full operational regime encountered in modern AI systems: source-specific monetary costs, random and nontrivial latency, and significantly asymmetric error profiles. The authors motivate the need to transcend one-shot model selection—replacing it with online sequential orchestration—by highlighting that real-world systems (e.g., OpenAI’s GPT-n routers, payment-fraud layers, agentic LLM pipelines) measure efficiency by balancing correctness, cost, and latency. Figure 1

Figure 1

Figure 1: Left: Output speed distribution for leading LLMs; Right: Intelligence Index for LLMs on standard benchmarks, showing pronounced heterogeneity relevant for orchestration policies.

Sequential Policy Structure and Theoretical Contributions

The agent’s policy consists of: at each time, selecting an LLM jtj_t, observing its noisy binary response, and accumulating the log-likelihood ratio (LLR); stopping occurs once the posterior probability of either hypothesis exceeds 1α1-\alpha. The overall risk/cost includes both cumulative monetary expenditure and a convex-in-wait penalty (e.g., polynomial in total latency), modeling settings with strict delay constraints.

Information Rates and Log-Likelihood Dynamics

  • Denote priors on θ{A,B}\theta\in\{A,B\} by (ξA,ξB)(\xi_A,\xi_B), and LLM jj's accuracies by (γj,A,γj,B)(\gamma_{j,A},\gamma_{j,B}).
  • For selection history j1:tj_{1:t} and outcomes Y1:tY_{1:t}, the LLR is jj0, with per-model increments jj1 varying by ground truth, as LLM diagnostics are asymmetric.
  • The drift under each hypothesis for model jj2 is jj3 and jj4. These provide a two-dimensional information efficiency profile, crucial for sequential allocation.

Universal Lower Bound via Convex Information Allocation

The paper’s first technical result is a universal lower bound on attainable risk for any admissible policy: asymptotic risk as jj5 is governed by solving a deterministic convex allocation over the jj6 available LLMs to satisfy information budget constraints for both hypotheses. The minimal expected cost/penalty achievable is the value of the following program:

jj7

where jj8 are the expected queries to LLM jj9 under each hypothesis, and jtj_t0 encodes total expected cost and delay. The core insight: this convex program’s structure admits an optimal solution using at most two nonzero allocations—corresponding to at most two distinct LLMs.

Policy Class and Asymptotically Optimal Structure

The principal positive result identifies a sign-based policy jtj_t1: the agent tracks the cumulative LLR and selects the “jtj_t2-specialist” when belief favors jtj_t3 (LLR jtj_t4) and the “jtj_t5-specialist” when it favors jtj_t6. Upon entering the high-confidence region (LLR crossing a threshold), the policy effectively acts as a single-source SPRT with the optimal specialist, minimizing additional cost. Switching between two sources is necessary in generic settings due to information asymmetry and cost heterogeneity. The policy’s asymptotic risk matches the universal lower bound up to jtj_t7 for a latency cost penalty of order jtj_t8, with this remainder being minimax optimal among all admissible policies.

Detailed Mathematical and Algorithmic Insights

Key Mathematical Developments

  • LLR Threshold Structure: Stopping is characterized by LLR exceeding jtj_t9 or 1α1-\alpha0, with thresholds depending on prior odds and the error tolerance.
  • Information Budget Constraints: All admissible policies must collect at least 1α1-\alpha1 (resp. 1α1-\alpha2) expected information under 1α1-\alpha3 (resp. 1α1-\alpha4), up to 1α1-\alpha5 corrections factoring in overshoot.
  • Sparse Allocation Property: Via KKT analysis of the convex program, the authors prove that for any cost-risk surface defined by monotone convex 1α1-\alpha6, optimal allocations have at most one nonzero 1α1-\alpha7 and one 1α1-\alpha8 for each hypothesis; if there is non-degeneracy (no exact resource ties), pure specialist policies are strictly optimal.
  • Martingale and Concentration Analysis: Fine-grained concentration, including Freedman-type inequalities on LLR martingale parts, is used to control the random overshoot and additional cost induced by switching or uncertain stopping times.

Consequences for LLM System Design

  • Only Two Sources Needed: For an optimal system, only two LLM APIs need to be orchestrated, regardless of the size of the available pool. The optimal specialists may differ for 1α1-\alpha9 and θ{A,B}\theta\in\{A,B\}0 directions, capturing cost and information asymmetry.
  • Practical Decision Guidance: The search for optimal orchestration reduces to identifying, for each possible true hypothesis, which LLM offers the best cost-adjusted information rate. Interleaving is only needed in the “ambiguous region” near the prior.
  • Scalability: The structural result demonstrates that system complexity need not scale with the pool of available LLMs—avoiding the need for complex routing trees or dynamic ensemble selection.

Numerical and Operational Implications

The cost-risk scaling is controlled by the specified confidence θ{A,B}\theta\in\{A,B\}1. If the waiting time penalty θ{A,B}\theta\in\{A,B\}2 is polynomial (e.g., linear), the expected cost grows as θ{A,B}\theta\in\{A,B\}3 in the high-confidence regime—matching the information-theoretic optimum—even in the presence of nontrivial, heavy-tailed latencies. For higher exponents, e.g., quadratic costs, the dominant asymptotic term becomes θ{A,B}\theta\in\{A,B\}4. The analysis covers the regime relevant for real-time decision platforms, such as content moderation systems, automated safety/fraud escalation, and multi-agent reasoning orchestrations.

Broader Landscape, Contrasts, and Extensions

  • Contrast with Prior LLM Test-Time Compute Work: Earlier studies emphasize empirical strategies (sampling, debate, verification, mixture-of-experts) but lack structural guarantees for multi-LLM sequential allocation under operational constraints [snell2024scaling, wu2024inference]. The present paper offers both a sharp theoretical framework and actionable scheduling guidance.
  • Relation to Model Cascades and Routing: Previous literature on fixed model cascades or routing (e.g., FrugalGPT, RouteLLM) focuses on static or input-dependent one-shot assignments. The current sequential framework generalizes the setting to allow for interleaved querying with posterior-adaptive source selection and derives strong asymptotic optimality.
  • Connections to Operations Research for LLM Serving: The work complements parallel efforts in task-level and system-level scheduling for LLM inference (e.g., [ao2025optimizing, jaillet2025online]), by addressing the information-acquisition subproblem faced at test time by a single job or query instance.

Future Directions

The framework opens multiple avenues for generalization: extension to multi-hypothesis or structured output spaces, modeling fully continuous accuracy/cost/latency tradeoffs, and incorporating model-roster learning (where the statistical profile is not fully known a priori). Analyzing finite-sample nonasymptotics, adaptive or bandit settings with online source characteristic estimation, and robustification to adversarial (non-IID) response structures would advance the theoretical interface between sequential experiment design and practical LLM system operations.

Conclusion

This work establishes a comprehensive, information-theoretically sharp blueprint for sequential orchestration of heterogeneous LLMs in binary hypothesis testing. By dissecting the two-dimensional information-cost tradeoff and demonstrating that at most two LLM specialists are required for high-confidence optimality, the paper delivers both novel theoretical principles and powerful practical guidance for multi-model AI system design. The integration of martingale analysis, convex program allocations, and explicit operational constraints advances the state of the art in LLM deployment science. Figure 1

Figure 1

Figure 1: Left: Distributions of LLM output speed. Right: Distribution of Intelligence Index for major LLMs, highlighting test-time heterogeneity that motivates sequential orchestration.

References:

See (2604.01086) for full details. For related works: [snell2024scaling], [wu2024inference], [ao2025optimizing], and [huang2026optimal].

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.