GreenServ: Dynamic, Energy-Efficient LLM Routing

Updated 31 January 2026

GreenServ is a dynamic, context-aware routing framework that integrates query context generation, semantic clustering, and text complexity assessment for multi-model LLM inference.
It employs a LinUCB-based multi-armed bandit algorithm to balance inference accuracy and energy consumption by adapting routing decisions based on observed metrics.
Experimental evaluations demonstrate significant improvements, including a 22% accuracy boost and a 31% reduction in energy usage, validating its adaptive performance.

GreenServ is a dynamic, context-aware routing framework developed for energy-efficient and accurate inference in multi-model LLM systems. It addresses the inefficiencies of static, one-model-fits-all workflows by adapting inference routing to contextual query features and observed model-specific metrics, thereby optimizing the trade-off between accuracy and energy consumption (Ziller et al., 24 Jan 2026).

1. System Architecture and Workflow

GreenServ consists of three primary components: the Query Context Generator, Router Agent Trainer, and Online Deployment mechanism.

Query Context Generator decomposes each incoming query $q_t$ into three representations:

Task Classifier extracts the “instruction” ( $q_{\text{instr},t}$ ), embeds it ( $e_{\text{instr},t}$ ), and classifies task type via logistic regression:

$p(l\mid e_{\text{instr},t}) = \sigma(W\,e_{\text{instr},t} + b),\quad l_t\in\{1,\dots,N_{\mathrm{tasks}}\}$

Semantic Clusterer embeds the full query ( $e_{\mathrm{full},t}$ ), assigns it to a semantic cluster $c_t$ by cosine similarity with cluster centroids $\mu_c$ , and updates centroids online:

$c_t = \arg\max_{c} \frac{e_{\mathrm{full},t}\cdot\mu_c}{\|e_{\mathrm{full},t}\|\|\mu_c\|};\quad \mu_{c_t}\leftarrow\mu_{c_t}+\frac{1}{N_{c_t}+1}\bigl(e_{\mathrm{full},t}-\mu_{c_t}\bigr)$

Text Complexity Assessor calculates the Flesch Reading Ease score and bins text into complexity categories:

$p(q_t)=206.835 - 1.015\frac{\text{Words}_t}{\text{Sentences}_t} -84.6\frac{\text{Syllables}_t}{\text{Words}_t}$

The context features are one-hot encoded into a fixed-length vector: $x_t = [\,\mathrm{onehot}(l_t),\,\mathrm{onehot}(c_t),\,\mathrm{onehot}(p_t),\,1\,] \in \mathbb{R}^d$ Where $d = N_{\rm tasks} + K + N_{\rm bins} + 1$ .

Router Agent Trainer applies latency filtering to produce a feasible model set per query: $M_t^* = \{m : L_m(q_t) \leq L_{\max,t}\}$ It then uses a contextual multi-armed bandit (LinUCB) to select the model maximizing estimated reward.

Online Deployment executes the routing for each query via the following steps:

Generate $x_t$
Filter models by latency
Select model $m_t$ using LinUCB
Execute inference, record accuracy and energy usage
Compute reward and update bandit parameters.

2. Contextual Multi-Armed Bandit Formulation

GreenServ leverages the LinUCB algorithm to route queries effectively:

Arms: Each candidate LLM model $m \in M$
Reward Prediction: For each model,

$\hat{r}_m(x_t) = \theta_m^\top x_t, \quad \theta_m = A_m^{-1}b_m$

Action Selection: Model $m_t$ is chosen by maximizing estimated reward plus an exploration bonus:

$m_t = \arg\max_{m\in M_t^*} \left[\theta_m^\top x_t + \alpha \sqrt{x_t^\top A_m^{-1}x_t}\right]$

Reward Signal: After inference,

$r_t(m,q_t) = (1-\lambda)\,Acc_m(q_t) - \lambda\,C_m(q_t),\quad \lambda \in [0,1]$

Regret Minimization:

$\mathrm{Regret}(T)=\sum_{t=1}^T \left[r_t(m_t^*,q_t) - r_t(m_t,q_t)\right]$

Here, $Acc_m(q_t)$ is query accuracy [0,1] and $C_m(q_t)$ is cumulative energy in Wh.

Online Learning is facilitated by partial feedback: only the reward for the chosen arm (model) is observed and updated. Zero-calibration allows rapid integration of new models.

3. Metrics: Accuracy, Energy, and Latency

GreenServ explicitly measures both inference accuracy and direct hardware energy consumption.

Accuracy ( $Acc_m(q_t)$ ): For each query, normalized exact-match, BLEU, or ROUGE score as appropriate for the task.
Energy Consumption ( $C_m(q_t), [Wh]$ ): Integrated GPU power over inference duration, directly measured via NVIDIA Management Library (NVML) or the Zeus profiling tool at millisecond granularity.
Latency ( $L_m(q_t)$ ): Time from model-launch to first-token; queuing effects are excluded.

The reward function scalarizes these objectives, allowing explicit control over the accuracy-energy trade-off via the parameter $\lambda$ .

4. Experimental Evaluation and Results

GreenServ was benchmarked using five tasks (500 samples each): MMLU (QA), HellaSwag (commonsense), Winogrande (WSC), GSM8K (math reasoning), CNN/DailyMail (summarization). The model pool includes 16 open-access LLMs: Qwen2.5-{0.5,1.5,3,7,14}B, Mistral-7B, Gemma3-{1,4,12,27}B, Llama3-{1,3,8}B, Phi-4-mini-4B, Phi-4-14B, Yi-34B.

Baselines include random routing, static single-model selection, non-contextual and contextual ε-Greedy, and contextual Thompson Sampling.

Key findings (at $\lambda=0.4$ , 50 runs):

GreenServ (LinUCB) vs. Random: +22% accuracy (≈0.65 vs. 0.53), –31% energy consumption (≈165 Wh vs. 240 Wh).
Compared to smallest model: +400% accuracy, +400% energy.
Compared to highest-accuracy static: –75% energy for only –10% accuracy.
Pareto front: GreenServ and contextual bandits are consistently at or above the static Pareto frontier.
Regret after 2500 queries: LinUCB ≈412, Contextual TS ≈400, non-contextual ε-Greedy ≈466.
RouterBench external validation (36K queries, 9 tasks): GreenServ (LinUCB) peak 75.7%, avg 71.7% accuracy, AIQ=0.607.

5. Policy Update and Online Adaptation

Online adaptation is achieved through Algorithm 1 (LinUCB-based router):

Initialize A_m ← I_d, b_m ← 0 (for all models m)
for t = 1 … T do
  x_t ← GenerateContext(q_t)
  M*_t ← {m | L_m(q_t) ≤ L_max}
  for each m in M*_t do
    θ_m ← A_m⁻¹ b_m
    bonus_m ← α · sqrt(x_tᵀ A_m⁻¹ x_t)
    score_m ← θ_mᵀ x_t + bonus_m
  end
  m_t ← argmax_{m∈M*_t} score_m
  response ← Inference(m_t, q_t)
  measure Acc, Energy from response
  r_t ← (1−λ)·Acc − λ·Energy
  A_{m_t} ← A_{m_t} + x_t x_tᵀ
  b_{m_t} ← b_{m_t} + r_t x_t
end

Policy explores and exploits via the bonus term, with partial feedback and zero-calibration. New models are integrated seamlessly as new arms with appropriate uncertainty quantification.

6. Strengths, Limitations, and Areas for Future Study

Strengths:

GreenServ utilizes multi-feature context (task, semantic cluster, text complexity) for fine-grained model-query matching.
The LinUCB online learning framework is robust to nonstationarity and allows immediate integration of new models without costly calibration.
Direct, hardware-level measurement of energy consumption enhances metric fidelity.
Routing incurs low computational overhead (≈7 ms per query) relative to inference time (40–200 ms).

Limitations:

Assumes stationarity in reward; slow adaptation to concept drift may occur. Decay or recalibration methods could mitigate this.
Evaluation focuses on tasks with objective ground truth; extension to open-ended generation will require alternative feedback mechanisms.
Hardware profiling is necessary for generalizing to GPUs or TPUs beyond those tested.
Feature engineering sensitivity: performance depends on choice and granularity of clusters, bins, and embedding models.
Real-world deployment aspects such as concurrency, batching, queuing, and memory management require further examination.

This suggests that contextual, bandit-based routing mechanisms such as GreenServ enable near-Pareto-optimal trade-offs in multi-model LLM inference scenarios, achieving strong accuracy and energy efficiency without substantial overhead or calibration requirements (Ziller et al., 24 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

GreenServ: Energy-Efficient Context-Aware Dynamic Routing for Multi-Model LLM Inference (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GreenServ.