Papers
Topics
Authors
Recent
Search
2000 character limit reached

Query-Aware Budget-Tier Routing

Updated 8 February 2026
  • Query-aware budget-tier routing is an advanced strategy that dynamically assigns LLM queries to different models based on real-time cost and quality constraints.
  • It employs methods like contextual bandits and reinforcement learning to optimize model selection and balance performance with budget limits.
  • Empirical results demonstrate significant benefits including up to 89.8% cost savings and enhanced accuracy-cost tradeoffs in multi-model systems.

Query-aware budget-tier routing is an emerging methodology for deploying LLMs and their variants under token, dollar, or infrastructure cost constraints by dynamically selecting the most appropriate model—or computation tier—for each incoming query. This paradigm exploits the heterogeneity in LLM capabilities and costs, allocating queries to model configurations that optimize the quality-cost tradeoff according to real-time context and user-specific budgets. The following sections synthesize current research on formal models, algorithmic frameworks, architectural realizations, and empirical characteristics of this approach across LLM inference, agent memory, and multi-model orchestration settings.

1. Formal Problem Statement and Cost-Constrained Optimization

At its core, query-aware budget-tier routing concerns the allocation of queries Q={q1,...,qT}Q = \{q_1, ..., q_T\} to a pool of models M={m1,...,mK}M = \{m_1, ..., m_K\} under a global or per-query budget BB over token/dollar consumption. The objective is to maximize expected cumulative reward—typically measured via ground-truth or LLM-based quality scores r(q,m)[0,1]r(q, m)\in[0,1]—subject to a strict cost constraint:

maxπEπ[t=1Tr(qt,mt)]subject tot=1Tc(qt,mt)B\max_\pi \mathbb{E}_\pi \left[ \sum_{t=1}^T r(q_t, m_t) \right] \quad \text{subject to} \quad \sum_{t=1}^T c(q_t, m_t) \le B

c(qt,mt)c(q_t, m_t) expresses the expected cost of routing query qtq_t to model mtm_t, allowing for token, latency, or dollar-based granularities (Panda et al., 28 Aug 2025). Related variants include per-query cost tiers, budgeted subpopulations, and online knapsack constraints (Panda et al., 28 Aug 2025, Piskala et al., 23 Feb 2025, Xue et al., 2 Feb 2026, Zhang et al., 5 Feb 2026, Ding et al., 2024). Many extensions incorporate multi-objective tradeoffs, e.g., balancing cost, latency, accuracy, and non-functional criteria via user-adjustable weights (Piskala et al., 23 Feb 2025).

2. Algorithmic Frameworks for Query-Aware Routing

2.1 Contextual Bandit and RL Formulations

Modern approaches recast LLM routing as an online contextual bandit problem. Here, each query is featurized as a context xqx_q (embedding, task representation), while each model or tier is an arm. The PILOT (Preference-Informed LinUCB) algorithm extends LinUCB by initializing arm parameters from offline human preference triplets and refining online via bandit feedback; the UCB scoring formula integrates both affinity and uncertainty estimates (Panda et al., 28 Aug 2025). Online cost control is imposed via knapsack-oriented eligibility filtering per query/bin.

Alternatively, reinforcement learning with policy gradients is used to train neural routers that allocate budget tiers across modular pipelines (as in BudgetMem), trading task performance versus resource cost in a PPO framework with cost-aligned joint rewards (Zhang et al., 5 Feb 2026). In RL-based systems such as PROTEUS, SLA-aware routing is enforced through Lagrangian dual updates, where a learned dual variable modulates the routing policy to meet per-query or batch-level accuracy constraints while minimizing realized cost (Bhatti et al., 27 Jan 2026).

2.2 Multi-Dimensional Tiering and Reasoning

Budget-tier routing generalizes beyond discrete model pools to operate over cross-products of model choice and operational parameters (e.g., token budget, reasoning style, architecture scale). R2-Router prompts for joint selection of model MM and output budget bb—with score function:

S(x,M,b)=(1λ)Q(x,M,b)λC(M,b)S(x, M, b) = (1-\lambda) Q(x, M, b) - \lambda C(M, b)

where QQ and CC are query- and budget-tier-aware quality and cost predictors, and λ\lambda controls cost sensitivity (Xue et al., 2 Feb 2026). This facilitates reasoning over a continuum of inference configurations, in contrast to fixed model-point routing.

BEST-Route further extends the space to (model, sample count) pairs: for each query, it predicts the probability that a given model and best-of-kk sample size will meet a quality bar, leading to a cost-efficient filter-and-minimize selection (Ding et al., 28 Jun 2025).

3. Realizations across System Architectures

3.1 Modular Agent Memory and Sharded Retrieval

In agentic LLM settings, BudgetMem structures runtime memory extraction as a pipeline of modules, each capable of being executed at multiple budget tiers (Low, Mid, High). Routing is performed at the module level, with a shared lightweight policy network choosing the cost-quality tradeoff axis (implementation tier, reasoning tier, or capacity tier per module) per query (Zhang et al., 5 Feb 2026). ShardMemo adopts a masked mixture-of-experts routing mechanism, supporting hard eligibility gating and budgeted probe caps in sharded memory architectures (Zhao et al., 29 Jan 2026).

3.2 Multi-Objective and Preference-Driven Model Routing

Systems such as OptiRoute leverage multi-objective optimization and hybrid routing (e.g., kNN search with hierarchical filtering), mapping user-defined weights for cost, latency, and accuracy—plus budget caps—into rapid, explainable decision policies (Piskala et al., 23 Feb 2025). Performance, cost, and latency curves are monitored at distinct budget tiers to provide users with explicit control over tradeoff selection.

3.3 SLA Control and Operator-Friendly Interfaces

PROTEUS demonstrates an SLA-oriented architecture, enabling operators to specify an accuracy floor τ\tau at runtime. Its policy network is conditioned jointly on the query, the desired accuracy, and a learned Lagrange multiplier. Empirically, this yields 100% floor compliance for SLA constraints, and runtime cost savings up to 89.8%89.8\% relative to static policies (Bhatti et al., 27 Jan 2026).

4. Mechanisms for Budget and Tier Control

Several approaches implement budget control as a soft or hard eligibility constraint over model or module selection per query:

Approach Control Mechanism Tiering Axis Reference
PILOT Online knapsack (ZCL), arms Model (Panda et al., 28 Aug 2025)
BudgetMem PPO RL with cost reward Module tier (imp/reason/cap) (Zhang et al., 5 Feb 2026)
R2-Router Explicit token-budget control Model × Output Len (Xue et al., 2 Feb 2026)
ShardMemo Probe cap, cost-bias gating Shard selection (Zhao et al., 29 Jan 2026)
OptiRoute User-specified cost constraint Model (Piskala et al., 23 Feb 2025)
PROTEUS SLA floor via Lagrangian dual Model (Bhatti et al., 27 Jan 2026)

Budget tiers may be represented as discrete sets—e.g., B={10,100,500,...}\mathcal{B} = \{10, 100, 500, ...\} tokens or Low/Mid/High module configurations—or continuous controls (e.g., confidence thresholds, penalty weights). The adaptive control of these tiers is essential to navigating the convex Pareto frontiers observed in empirical cost/quality curves.

5. Empirical Findings and Failure Modes

Experiments across LoCoMo, HotpotQA, LongMemEval, and large MLaaS traces consistently demonstrate that query-aware budget-tier routing substantially improves both cost and accuracy metrics. For example:

  • BudgetMem variants outperform all prior memory-augmented agents in both F1 and Judge scores across datasets and yield superior accuracy-cost frontiers under tight and loose budgets (Zhang et al., 5 Feb 2026).
  • OptiRoute achieves up to 60% cost savings with only a 6.6% accuracy drop at low budgets, with convex tradeoff curves (Piskala et al., 23 Feb 2025).
  • R2-Router realizes state-of-the-art AUDC and QNC, performing at 4-5× lower cost than reactive baselines (Xue et al., 2 Feb 2026).
  • PROTEUS policies satisfy stringent user accuracy floors dynamically while delivering major cost reductions in production-scale benchmarks (Bhatti et al., 27 Jan 2026).

A notable challenge is "routing collapse": as user budget increases, scalar-prediction routers tend to default toward the strongest model, neglecting cheaper models even when they would suffice. EquiRouter directly addresses this via a ranking-based loss aligned with decision structure, reducing cost by 17% at GPT-4-level performance versus prior state-of-the-art routers (Lai et al., 3 Feb 2026).

6. Practical Guidelines and Design Recommendations

7. Extensions, Open Challenges, and Future Directions

Current systems predominantly rely on manually defined modules, discrete cost tiers, and offline human-labeled or LLM-judge supervision. Future work may emphasize:

  • Automatic module/tier discovery via meta-learning or differentiable architecture search (Zhang et al., 5 Feb 2026).
  • Finer-grained and continuous budget control to algorithmically approach Pareto frontiers in smoother detail.
  • Joint optimization across retrieval, routing, and inference for tighter envelope on performance-cost tradeoffs.
  • Extension to out-of-distribution queries, dynamic model pools, and federated/multi-tenant settings for broader applicability (Li et al., 8 Jun 2025, Wu et al., 2 Sep 2025).
  • Transparent user and operator interfaces that expose interpretable cost-quality tradeoffs and enable on-the-fly SLA specification (Piskala et al., 23 Feb 2025, Bhatti et al., 27 Jan 2026).

The overarching trend coalesces around query-aware, fine-grained, and explicitly budget-controlled routing systems grounded in principled online optimization, reinforcement learning, and modular system architecture. These advances jointly enable cost-effective and adaptive deployment of heterogeneous LLMs at scale.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Query-Aware Budget-Tier Routing.