Papers
Topics
Authors
Recent
Search
2000 character limit reached

GreenServ: Dynamic, Energy-Efficient LLM Routing

Updated 31 January 2026
  • GreenServ is a dynamic, context-aware routing framework that integrates query context generation, semantic clustering, and text complexity assessment for multi-model LLM inference.
  • It employs a LinUCB-based multi-armed bandit algorithm to balance inference accuracy and energy consumption by adapting routing decisions based on observed metrics.
  • Experimental evaluations demonstrate significant improvements, including a 22% accuracy boost and a 31% reduction in energy usage, validating its adaptive performance.

GreenServ is a dynamic, context-aware routing framework developed for energy-efficient and accurate inference in multi-model LLM systems. It addresses the inefficiencies of static, one-model-fits-all workflows by adapting inference routing to contextual query features and observed model-specific metrics, thereby optimizing the trade-off between accuracy and energy consumption (Ziller et al., 24 Jan 2026).

1. System Architecture and Workflow

GreenServ consists of three primary components: the Query Context Generator, Router Agent Trainer, and Online Deployment mechanism.

Query Context Generator decomposes each incoming query qtq_t into three representations:

  • Task Classifier extracts the “instruction” (qinstr,tq_{\text{instr},t}), embeds it (einstr,te_{\text{instr},t}), and classifies task type via logistic regression:

p(leinstr,t)=σ(Weinstr,t+b),lt{1,,Ntasks}p(l\mid e_{\text{instr},t}) = \sigma(W\,e_{\text{instr},t} + b),\quad l_t\in\{1,\dots,N_{\mathrm{tasks}}\}

  • Semantic Clusterer embeds the full query (efull,te_{\mathrm{full},t}), assigns it to a semantic cluster ctc_t by cosine similarity with cluster centroids μc\mu_c, and updates centroids online:

ct=argmaxcefull,tμcefull,tμc;μctμct+1Nct+1(efull,tμct)c_t = \arg\max_{c} \frac{e_{\mathrm{full},t}\cdot\mu_c}{\|e_{\mathrm{full},t}\|\|\mu_c\|};\quad \mu_{c_t}\leftarrow\mu_{c_t}+\frac{1}{N_{c_t}+1}\bigl(e_{\mathrm{full},t}-\mu_{c_t}\bigr)

  • Text Complexity Assessor calculates the Flesch Reading Ease score and bins text into complexity categories:

p(qt)=206.8351.015WordstSentencest84.6SyllablestWordstp(q_t)=206.835 - 1.015\frac{\text{Words}_t}{\text{Sentences}_t} -84.6\frac{\text{Syllables}_t}{\text{Words}_t}

The context features are one-hot encoded into a fixed-length vector: xt=[onehot(lt),onehot(ct),onehot(pt),1]Rdx_t = [\,\mathrm{onehot}(l_t),\,\mathrm{onehot}(c_t),\,\mathrm{onehot}(p_t),\,1\,] \in \mathbb{R}^d Where d=Ntasks+K+Nbins+1d = N_{\rm tasks} + K + N_{\rm bins} + 1.

Router Agent Trainer applies latency filtering to produce a feasible model set per query: Mt={m:Lm(qt)Lmax,t}M_t^* = \{m : L_m(q_t) \leq L_{\max,t}\} It then uses a contextual multi-armed bandit (LinUCB) to select the model maximizing estimated reward.

Online Deployment executes the routing for each query via the following steps:

  1. Generate xtx_t
  2. Filter models by latency
  3. Select model mtm_t using LinUCB
  4. Execute inference, record accuracy and energy usage
  5. Compute reward and update bandit parameters.

2. Contextual Multi-Armed Bandit Formulation

GreenServ leverages the LinUCB algorithm to route queries effectively:

  • Arms: Each candidate LLM model mMm \in M
  • Reward Prediction: For each model,

r^m(xt)=θmxt,θm=Am1bm\hat{r}_m(x_t) = \theta_m^\top x_t, \quad \theta_m = A_m^{-1}b_m

  • Action Selection: Model mtm_t is chosen by maximizing estimated reward plus an exploration bonus:

mt=argmaxmMt[θmxt+αxtAm1xt]m_t = \arg\max_{m\in M_t^*} \left[\theta_m^\top x_t + \alpha \sqrt{x_t^\top A_m^{-1}x_t}\right]

  • Reward Signal: After inference,

rt(m,qt)=(1λ)Accm(qt)λCm(qt),λ[0,1]r_t(m,q_t) = (1-\lambda)\,Acc_m(q_t) - \lambda\,C_m(q_t),\quad \lambda \in [0,1]

  • Regret Minimization:

Regret(T)=t=1T[rt(mt,qt)rt(mt,qt)]\mathrm{Regret}(T)=\sum_{t=1}^T \left[r_t(m_t^*,q_t) - r_t(m_t,q_t)\right]

Here, Accm(qt)Acc_m(q_t) is query accuracy [0,1] and Cm(qt)C_m(q_t) is cumulative energy in Wh.

Online Learning is facilitated by partial feedback: only the reward for the chosen arm (model) is observed and updated. Zero-calibration allows rapid integration of new models.

3. Metrics: Accuracy, Energy, and Latency

GreenServ explicitly measures both inference accuracy and direct hardware energy consumption.

  • Accuracy (Accm(qt)Acc_m(q_t)): For each query, normalized exact-match, BLEU, or ROUGE score as appropriate for the task.
  • Energy Consumption (Cm(qt),[Wh]C_m(q_t), [Wh]): Integrated GPU power over inference duration, directly measured via NVIDIA Management Library (NVML) or the Zeus profiling tool at millisecond granularity.
  • Latency (Lm(qt)L_m(q_t)): Time from model-launch to first-token; queuing effects are excluded.

The reward function scalarizes these objectives, allowing explicit control over the accuracy-energy trade-off via the parameter λ\lambda.

4. Experimental Evaluation and Results

GreenServ was benchmarked using five tasks (500 samples each): MMLU (QA), HellaSwag (commonsense), Winogrande (WSC), GSM8K (math reasoning), CNN/DailyMail (summarization). The model pool includes 16 open-access LLMs: Qwen2.5-{0.5,1.5,3,7,14}B, Mistral-7B, Gemma3-{1,4,12,27}B, Llama3-{1,3,8}B, Phi-4-mini-4B, Phi-4-14B, Yi-34B.

Baselines include random routing, static single-model selection, non-contextual and contextual ε-Greedy, and contextual Thompson Sampling.

Key findings (at λ=0.4\lambda=0.4, 50 runs):

  • GreenServ (LinUCB) vs. Random: +22% accuracy (≈0.65 vs. 0.53), –31% energy consumption (≈165 Wh vs. 240 Wh).
  • Compared to smallest model: +400% accuracy, +400% energy.
  • Compared to highest-accuracy static: –75% energy for only –10% accuracy.
  • Pareto front: GreenServ and contextual bandits are consistently at or above the static Pareto frontier.
  • Regret after 2500 queries: LinUCB ≈412, Contextual TS ≈400, non-contextual ε-Greedy ≈466.
  • RouterBench external validation (36K queries, 9 tasks): GreenServ (LinUCB) peak 75.7%, avg 71.7% accuracy, AIQ=0.607.

5. Policy Update and Online Adaptation

Online adaptation is achieved through Algorithm 1 (LinUCB-based router):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Initialize A_m  I_d, b_m  0 (for all models m)
for t = 1  T do
  x_t  GenerateContext(q_t)
  M*_t  {m | L_m(q_t)  L_max}
  for each m in M*_t do
    θ_m  A_m¹ b_m
    bonus_m  α · sqrt(x_tᵀ A_m¹ x_t)
    score_m  θ_mᵀ x_t + bonus_m
  end
  m_t  argmax_{mM*_t} score_m
  response  Inference(m_t, q_t)
  measure Acc, Energy from response
  r_t  (1λ)·Acc  λ·Energy
  A_{m_t}  A_{m_t} + x_t x_tᵀ
  b_{m_t}  b_{m_t} + r_t x_t
end
Policy explores and exploits via the bonus term, with partial feedback and zero-calibration. New models are integrated seamlessly as new arms with appropriate uncertainty quantification.

6. Strengths, Limitations, and Areas for Future Study

Strengths:

  • GreenServ utilizes multi-feature context (task, semantic cluster, text complexity) for fine-grained model-query matching.
  • The LinUCB online learning framework is robust to nonstationarity and allows immediate integration of new models without costly calibration.
  • Direct, hardware-level measurement of energy consumption enhances metric fidelity.
  • Routing incurs low computational overhead (≈7 ms per query) relative to inference time (40–200 ms).

Limitations:

  • Assumes stationarity in reward; slow adaptation to concept drift may occur. Decay or recalibration methods could mitigate this.
  • Evaluation focuses on tasks with objective ground truth; extension to open-ended generation will require alternative feedback mechanisms.
  • Hardware profiling is necessary for generalizing to GPUs or TPUs beyond those tested.
  • Feature engineering sensitivity: performance depends on choice and granularity of clusters, bins, and embedding models.
  • Real-world deployment aspects such as concurrency, batching, queuing, and memory management require further examination.

This suggests that contextual, bandit-based routing mechanisms such as GreenServ enable near-Pareto-optimal trade-offs in multi-model LLM inference scenarios, achieving strong accuracy and energy efficiency without substantial overhead or calibration requirements (Ziller et al., 24 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GreenServ.