Query-Aware Budget-Tier Routing

Updated 8 February 2026

Query-aware budget-tier routing is an advanced strategy that dynamically assigns LLM queries to different models based on real-time cost and quality constraints.
It employs methods like contextual bandits and reinforcement learning to optimize model selection and balance performance with budget limits.
Empirical results demonstrate significant benefits including up to 89.8% cost savings and enhanced accuracy-cost tradeoffs in multi-model systems.

Query-aware budget-tier routing is an emerging methodology for deploying LLMs and their variants under token, dollar, or infrastructure cost constraints by dynamically selecting the most appropriate model—or computation tier—for each incoming query. This paradigm exploits the heterogeneity in LLM capabilities and costs, allocating queries to model configurations that optimize the quality-cost tradeoff according to real-time context and user-specific budgets. The following sections synthesize current research on formal models, algorithmic frameworks, architectural realizations, and empirical characteristics of this approach across LLM inference, agent memory, and multi-model orchestration settings.

1. Formal Problem Statement and Cost-Constrained Optimization

At its core, query-aware budget-tier routing concerns the allocation of queries $Q = \{q_1, ..., q_T\}$ to a pool of models $M = \{m_1, ..., m_K\}$ under a global or per-query budget $B$ over token/dollar consumption. The objective is to maximize expected cumulative reward—typically measured via ground-truth or LLM-based quality scores $r(q, m)\in[0,1]$ —subject to a strict cost constraint:

$\max_\pi \mathbb{E}_\pi \left[ \sum_{t=1}^T r(q_t, m_t) \right] \quad \text{subject to} \quad \sum_{t=1}^T c(q_t, m_t) \le B$

$c(q_t, m_t)$ expresses the expected cost of routing query $q_t$ to model $m_t$ , allowing for token, latency, or dollar-based granularities (Panda et al., 28 Aug 2025). Related variants include per-query cost tiers, budgeted subpopulations, and online knapsack constraints (Panda et al., 28 Aug 2025, Piskala et al., 23 Feb 2025, Xue et al., 2 Feb 2026, Zhang et al., 5 Feb 2026, Ding et al., 2024). Many extensions incorporate multi-objective tradeoffs, e.g., balancing cost, latency, accuracy, and non-functional criteria via user-adjustable weights (Piskala et al., 23 Feb 2025).

2. Algorithmic Frameworks for Query-Aware Routing

2.1 Contextual Bandit and RL Formulations

Modern approaches recast LLM routing as an online contextual bandit problem. Here, each query is featurized as a context $x_q$ (embedding, task representation), while each model or tier is an arm. The PILOT (Preference-Informed LinUCB) algorithm extends LinUCB by initializing arm parameters from offline human preference triplets and refining online via bandit feedback; the UCB scoring formula integrates both affinity and uncertainty estimates (Panda et al., 28 Aug 2025). Online cost control is imposed via knapsack-oriented eligibility filtering per query/bin.

Alternatively, reinforcement learning with policy gradients is used to train neural routers that allocate budget tiers across modular pipelines (as in BudgetMem), trading task performance versus resource cost in a PPO framework with cost-aligned joint rewards (Zhang et al., 5 Feb 2026). In RL-based systems such as PROTEUS, SLA-aware routing is enforced through Lagrangian dual updates, where a learned dual variable modulates the routing policy to meet per-query or batch-level accuracy constraints while minimizing realized cost (Bhatti et al., 27 Jan 2026).

2.2 Multi-Dimensional Tiering and Reasoning

Budget-tier routing generalizes beyond discrete model pools to operate over cross-products of model choice and operational parameters (e.g., token budget, reasoning style, architecture scale). R2-Router prompts for joint selection of model $M$ and output budget $b$ —with score function:

$S(x, M, b) = (1-\lambda) Q(x, M, b) - \lambda C(M, b)$

where $Q$ and $C$ are query- and budget-tier-aware quality and cost predictors, and $\lambda$ controls cost sensitivity (Xue et al., 2 Feb 2026). This facilitates reasoning over a continuum of inference configurations, in contrast to fixed model-point routing.

BEST-Route further extends the space to (model, sample count) pairs: for each query, it predicts the probability that a given model and best-of- $k$ sample size will meet a quality bar, leading to a cost-efficient filter-and-minimize selection (Ding et al., 28 Jun 2025).

3. Realizations across System Architectures

3.1 Modular Agent Memory and Sharded Retrieval

In agentic LLM settings, BudgetMem structures runtime memory extraction as a pipeline of modules, each capable of being executed at multiple budget tiers (Low, Mid, High). Routing is performed at the module level, with a shared lightweight policy network choosing the cost-quality tradeoff axis (implementation tier, reasoning tier, or capacity tier per module) per query (Zhang et al., 5 Feb 2026). ShardMemo adopts a masked mixture-of-experts routing mechanism, supporting hard eligibility gating and budgeted probe caps in sharded memory architectures (Zhao et al., 29 Jan 2026).

3.2 Multi-Objective and Preference-Driven Model Routing

Systems such as OptiRoute leverage multi-objective optimization and hybrid routing (e.g., kNN search with hierarchical filtering), mapping user-defined weights for cost, latency, and accuracy—plus budget caps—into rapid, explainable decision policies (Piskala et al., 23 Feb 2025). Performance, cost, and latency curves are monitored at distinct budget tiers to provide users with explicit control over tradeoff selection.

3.3 SLA Control and Operator-Friendly Interfaces

PROTEUS demonstrates an SLA-oriented architecture, enabling operators to specify an accuracy floor $\tau$ at runtime. Its policy network is conditioned jointly on the query, the desired accuracy, and a learned Lagrange multiplier. Empirically, this yields 100% floor compliance for SLA constraints, and runtime cost savings up to $89.8\%$ relative to static policies (Bhatti et al., 27 Jan 2026).

4. Mechanisms for Budget and Tier Control

Several approaches implement budget control as a soft or hard eligibility constraint over model or module selection per query:

Approach	Control Mechanism	Tiering Axis	Reference
PILOT	Online knapsack (ZCL), arms	Model	(Panda et al., 28 Aug 2025)
BudgetMem	PPO RL with cost reward	Module tier (imp/reason/cap)	(Zhang et al., 5 Feb 2026)
R2-Router	Explicit token-budget control	Model × Output Len	(Xue et al., 2 Feb 2026)
ShardMemo	Probe cap, cost-bias gating	Shard selection	(Zhao et al., 29 Jan 2026)
OptiRoute	User-specified cost constraint	Model	(Piskala et al., 23 Feb 2025)
PROTEUS	SLA floor via Lagrangian dual	Model	(Bhatti et al., 27 Jan 2026)

Budget tiers may be represented as discrete sets—e.g., $\mathcal{B} = \{10, 100, 500, ...\}$ tokens or Low/Mid/High module configurations—or continuous controls (e.g., confidence thresholds, penalty weights). The adaptive control of these tiers is essential to navigating the convex Pareto frontiers observed in empirical cost/quality curves.

5. Empirical Findings and Failure Modes

Experiments across LoCoMo, HotpotQA, LongMemEval, and large MLaaS traces consistently demonstrate that query-aware budget-tier routing substantially improves both cost and accuracy metrics. For example:

BudgetMem variants outperform all prior memory-augmented agents in both F1 and Judge scores across datasets and yield superior accuracy-cost frontiers under tight and loose budgets (Zhang et al., 5 Feb 2026).
OptiRoute achieves up to 60% cost savings with only a 6.6% accuracy drop at low budgets, with convex tradeoff curves (Piskala et al., 23 Feb 2025).
R2-Router realizes state-of-the-art AUDC and QNC, performing at 4-5× lower cost than reactive baselines (Xue et al., 2 Feb 2026).
PROTEUS policies satisfy stringent user accuracy floors dynamically while delivering major cost reductions in production-scale benchmarks (Bhatti et al., 27 Jan 2026).

A notable challenge is "routing collapse": as user budget increases, scalar-prediction routers tend to default toward the strongest model, neglecting cheaper models even when they would suffice. EquiRouter directly addresses this via a ranking-based loss aligned with decision structure, reducing cost by 17% at GPT-4-level performance versus prior state-of-the-art routers (Lai et al., 3 Feb 2026).

6. Practical Guidelines and Design Recommendations

Supervised initialization and online bandit learning facilitate both accurate affinity estimation and continual adaptation to query distribution drift (Panda et al., 28 Aug 2025).
Multiaxis tiering strategies (implementation, reasoning, capacity) allow finer control and efficiency across diverse budget and quality regimes (Zhang et al., 5 Feb 2026).
Pairwise ranking losses rather than scalar pointwise predictions better align router outputs with discrete selection-in-budget, particularly mitigating degenerate routing collapse (Lai et al., 3 Feb 2026).
Explicit cost modeling and tight integration of known operational parameters (token pricing, latency metrics) are critical for meaningful frontier tracing and tight budget adherence (Xue et al., 2 Feb 2026, Piskala et al., 23 Feb 2025, Panda et al., 28 Aug 2025).
Calibration of tradeoff knobs (thresholds, dual parameters, tier choice) must be validated against real-world accuracy targets and cost constraints to guarantee SLA compliance in deployment (Bhatti et al., 27 Jan 2026, Ding et al., 28 Jun 2025, Piskala et al., 23 Feb 2025).

7. Extensions, Open Challenges, and Future Directions

Current systems predominantly rely on manually defined modules, discrete cost tiers, and offline human-labeled or LLM-judge supervision. Future work may emphasize:

Automatic module/tier discovery via meta-learning or differentiable architecture search (Zhang et al., 5 Feb 2026).
Finer-grained and continuous budget control to algorithmically approach Pareto frontiers in smoother detail.
Joint optimization across retrieval, routing, and inference for tighter envelope on performance-cost tradeoffs.
Extension to out-of-distribution queries, dynamic model pools, and federated/multi-tenant settings for broader applicability (Li et al., 8 Jun 2025, Wu et al., 2 Sep 2025).
Transparent user and operator interfaces that expose interpretable cost-quality tradeoffs and enable on-the-fly SLA specification (Piskala et al., 23 Feb 2025, Bhatti et al., 27 Jan 2026).

The overarching trend coalesces around query-aware, fine-grained, and explicitly budget-controlled routing systems grounded in principled online optimization, reinforcement learning, and modular system architecture. These advances jointly enable cost-effective and adaptive deployment of heterogeneous LLMs at scale.