Query-Aware Budget-Tier Routing
- Query-aware budget-tier routing is an advanced strategy that dynamically assigns LLM queries to different models based on real-time cost and quality constraints.
- It employs methods like contextual bandits and reinforcement learning to optimize model selection and balance performance with budget limits.
- Empirical results demonstrate significant benefits including up to 89.8% cost savings and enhanced accuracy-cost tradeoffs in multi-model systems.
Query-aware budget-tier routing is an emerging methodology for deploying LLMs and their variants under token, dollar, or infrastructure cost constraints by dynamically selecting the most appropriate model—or computation tier—for each incoming query. This paradigm exploits the heterogeneity in LLM capabilities and costs, allocating queries to model configurations that optimize the quality-cost tradeoff according to real-time context and user-specific budgets. The following sections synthesize current research on formal models, algorithmic frameworks, architectural realizations, and empirical characteristics of this approach across LLM inference, agent memory, and multi-model orchestration settings.
1. Formal Problem Statement and Cost-Constrained Optimization
At its core, query-aware budget-tier routing concerns the allocation of queries to a pool of models under a global or per-query budget over token/dollar consumption. The objective is to maximize expected cumulative reward—typically measured via ground-truth or LLM-based quality scores —subject to a strict cost constraint:
expresses the expected cost of routing query to model , allowing for token, latency, or dollar-based granularities (Panda et al., 28 Aug 2025). Related variants include per-query cost tiers, budgeted subpopulations, and online knapsack constraints (Panda et al., 28 Aug 2025, Piskala et al., 23 Feb 2025, Xue et al., 2 Feb 2026, Zhang et al., 5 Feb 2026, Ding et al., 2024). Many extensions incorporate multi-objective tradeoffs, e.g., balancing cost, latency, accuracy, and non-functional criteria via user-adjustable weights (Piskala et al., 23 Feb 2025).
2. Algorithmic Frameworks for Query-Aware Routing
2.1 Contextual Bandit and RL Formulations
Modern approaches recast LLM routing as an online contextual bandit problem. Here, each query is featurized as a context (embedding, task representation), while each model or tier is an arm. The PILOT (Preference-Informed LinUCB) algorithm extends LinUCB by initializing arm parameters from offline human preference triplets and refining online via bandit feedback; the UCB scoring formula integrates both affinity and uncertainty estimates (Panda et al., 28 Aug 2025). Online cost control is imposed via knapsack-oriented eligibility filtering per query/bin.
Alternatively, reinforcement learning with policy gradients is used to train neural routers that allocate budget tiers across modular pipelines (as in BudgetMem), trading task performance versus resource cost in a PPO framework with cost-aligned joint rewards (Zhang et al., 5 Feb 2026). In RL-based systems such as PROTEUS, SLA-aware routing is enforced through Lagrangian dual updates, where a learned dual variable modulates the routing policy to meet per-query or batch-level accuracy constraints while minimizing realized cost (Bhatti et al., 27 Jan 2026).
2.2 Multi-Dimensional Tiering and Reasoning
Budget-tier routing generalizes beyond discrete model pools to operate over cross-products of model choice and operational parameters (e.g., token budget, reasoning style, architecture scale). R2-Router prompts for joint selection of model and output budget —with score function:
where and are query- and budget-tier-aware quality and cost predictors, and controls cost sensitivity (Xue et al., 2 Feb 2026). This facilitates reasoning over a continuum of inference configurations, in contrast to fixed model-point routing.
BEST-Route further extends the space to (model, sample count) pairs: for each query, it predicts the probability that a given model and best-of- sample size will meet a quality bar, leading to a cost-efficient filter-and-minimize selection (Ding et al., 28 Jun 2025).
3. Realizations across System Architectures
3.1 Modular Agent Memory and Sharded Retrieval
In agentic LLM settings, BudgetMem structures runtime memory extraction as a pipeline of modules, each capable of being executed at multiple budget tiers (Low, Mid, High). Routing is performed at the module level, with a shared lightweight policy network choosing the cost-quality tradeoff axis (implementation tier, reasoning tier, or capacity tier per module) per query (Zhang et al., 5 Feb 2026). ShardMemo adopts a masked mixture-of-experts routing mechanism, supporting hard eligibility gating and budgeted probe caps in sharded memory architectures (Zhao et al., 29 Jan 2026).
3.2 Multi-Objective and Preference-Driven Model Routing
Systems such as OptiRoute leverage multi-objective optimization and hybrid routing (e.g., kNN search with hierarchical filtering), mapping user-defined weights for cost, latency, and accuracy—plus budget caps—into rapid, explainable decision policies (Piskala et al., 23 Feb 2025). Performance, cost, and latency curves are monitored at distinct budget tiers to provide users with explicit control over tradeoff selection.
3.3 SLA Control and Operator-Friendly Interfaces
PROTEUS demonstrates an SLA-oriented architecture, enabling operators to specify an accuracy floor at runtime. Its policy network is conditioned jointly on the query, the desired accuracy, and a learned Lagrange multiplier. Empirically, this yields 100% floor compliance for SLA constraints, and runtime cost savings up to relative to static policies (Bhatti et al., 27 Jan 2026).
4. Mechanisms for Budget and Tier Control
Several approaches implement budget control as a soft or hard eligibility constraint over model or module selection per query:
| Approach | Control Mechanism | Tiering Axis | Reference |
|---|---|---|---|
| PILOT | Online knapsack (ZCL), arms | Model | (Panda et al., 28 Aug 2025) |
| BudgetMem | PPO RL with cost reward | Module tier (imp/reason/cap) | (Zhang et al., 5 Feb 2026) |
| R2-Router | Explicit token-budget control | Model × Output Len | (Xue et al., 2 Feb 2026) |
| ShardMemo | Probe cap, cost-bias gating | Shard selection | (Zhao et al., 29 Jan 2026) |
| OptiRoute | User-specified cost constraint | Model | (Piskala et al., 23 Feb 2025) |
| PROTEUS | SLA floor via Lagrangian dual | Model | (Bhatti et al., 27 Jan 2026) |
Budget tiers may be represented as discrete sets—e.g., tokens or Low/Mid/High module configurations—or continuous controls (e.g., confidence thresholds, penalty weights). The adaptive control of these tiers is essential to navigating the convex Pareto frontiers observed in empirical cost/quality curves.
5. Empirical Findings and Failure Modes
Experiments across LoCoMo, HotpotQA, LongMemEval, and large MLaaS traces consistently demonstrate that query-aware budget-tier routing substantially improves both cost and accuracy metrics. For example:
- BudgetMem variants outperform all prior memory-augmented agents in both F1 and Judge scores across datasets and yield superior accuracy-cost frontiers under tight and loose budgets (Zhang et al., 5 Feb 2026).
- OptiRoute achieves up to 60% cost savings with only a 6.6% accuracy drop at low budgets, with convex tradeoff curves (Piskala et al., 23 Feb 2025).
- R2-Router realizes state-of-the-art AUDC and QNC, performing at 4-5× lower cost than reactive baselines (Xue et al., 2 Feb 2026).
- PROTEUS policies satisfy stringent user accuracy floors dynamically while delivering major cost reductions in production-scale benchmarks (Bhatti et al., 27 Jan 2026).
A notable challenge is "routing collapse": as user budget increases, scalar-prediction routers tend to default toward the strongest model, neglecting cheaper models even when they would suffice. EquiRouter directly addresses this via a ranking-based loss aligned with decision structure, reducing cost by 17% at GPT-4-level performance versus prior state-of-the-art routers (Lai et al., 3 Feb 2026).
6. Practical Guidelines and Design Recommendations
- Supervised initialization and online bandit learning facilitate both accurate affinity estimation and continual adaptation to query distribution drift (Panda et al., 28 Aug 2025).
- Multiaxis tiering strategies (implementation, reasoning, capacity) allow finer control and efficiency across diverse budget and quality regimes (Zhang et al., 5 Feb 2026).
- Pairwise ranking losses rather than scalar pointwise predictions better align router outputs with discrete selection-in-budget, particularly mitigating degenerate routing collapse (Lai et al., 3 Feb 2026).
- Explicit cost modeling and tight integration of known operational parameters (token pricing, latency metrics) are critical for meaningful frontier tracing and tight budget adherence (Xue et al., 2 Feb 2026, Piskala et al., 23 Feb 2025, Panda et al., 28 Aug 2025).
- Calibration of tradeoff knobs (thresholds, dual parameters, tier choice) must be validated against real-world accuracy targets and cost constraints to guarantee SLA compliance in deployment (Bhatti et al., 27 Jan 2026, Ding et al., 28 Jun 2025, Piskala et al., 23 Feb 2025).
7. Extensions, Open Challenges, and Future Directions
Current systems predominantly rely on manually defined modules, discrete cost tiers, and offline human-labeled or LLM-judge supervision. Future work may emphasize:
- Automatic module/tier discovery via meta-learning or differentiable architecture search (Zhang et al., 5 Feb 2026).
- Finer-grained and continuous budget control to algorithmically approach Pareto frontiers in smoother detail.
- Joint optimization across retrieval, routing, and inference for tighter envelope on performance-cost tradeoffs.
- Extension to out-of-distribution queries, dynamic model pools, and federated/multi-tenant settings for broader applicability (Li et al., 8 Jun 2025, Wu et al., 2 Sep 2025).
- Transparent user and operator interfaces that expose interpretable cost-quality tradeoffs and enable on-the-fly SLA specification (Piskala et al., 23 Feb 2025, Bhatti et al., 27 Jan 2026).
The overarching trend coalesces around query-aware, fine-grained, and explicitly budget-controlled routing systems grounded in principled online optimization, reinforcement learning, and modular system architecture. These advances jointly enable cost-effective and adaptive deployment of heterogeneous LLMs at scale.