GreenServ: Dynamic, Energy-Efficient LLM Routing
- GreenServ is a dynamic, context-aware routing framework that integrates query context generation, semantic clustering, and text complexity assessment for multi-model LLM inference.
- It employs a LinUCB-based multi-armed bandit algorithm to balance inference accuracy and energy consumption by adapting routing decisions based on observed metrics.
- Experimental evaluations demonstrate significant improvements, including a 22% accuracy boost and a 31% reduction in energy usage, validating its adaptive performance.
GreenServ is a dynamic, context-aware routing framework developed for energy-efficient and accurate inference in multi-model LLM systems. It addresses the inefficiencies of static, one-model-fits-all workflows by adapting inference routing to contextual query features and observed model-specific metrics, thereby optimizing the trade-off between accuracy and energy consumption (Ziller et al., 24 Jan 2026).
1. System Architecture and Workflow
GreenServ consists of three primary components: the Query Context Generator, Router Agent Trainer, and Online Deployment mechanism.
Query Context Generator decomposes each incoming query into three representations:
- Task Classifier extracts the “instruction” (), embeds it (), and classifies task type via logistic regression:
- Semantic Clusterer embeds the full query (), assigns it to a semantic cluster by cosine similarity with cluster centroids , and updates centroids online:
- Text Complexity Assessor calculates the Flesch Reading Ease score and bins text into complexity categories:
The context features are one-hot encoded into a fixed-length vector: Where .
Router Agent Trainer applies latency filtering to produce a feasible model set per query: It then uses a contextual multi-armed bandit (LinUCB) to select the model maximizing estimated reward.
Online Deployment executes the routing for each query via the following steps:
- Generate
- Filter models by latency
- Select model using LinUCB
- Execute inference, record accuracy and energy usage
- Compute reward and update bandit parameters.
2. Contextual Multi-Armed Bandit Formulation
GreenServ leverages the LinUCB algorithm to route queries effectively:
- Arms: Each candidate LLM model
- Reward Prediction: For each model,
- Action Selection: Model is chosen by maximizing estimated reward plus an exploration bonus:
- Reward Signal: After inference,
- Regret Minimization:
Here, is query accuracy [0,1] and is cumulative energy in Wh.
Online Learning is facilitated by partial feedback: only the reward for the chosen arm (model) is observed and updated. Zero-calibration allows rapid integration of new models.
3. Metrics: Accuracy, Energy, and Latency
GreenServ explicitly measures both inference accuracy and direct hardware energy consumption.
- Accuracy (): For each query, normalized exact-match, BLEU, or ROUGE score as appropriate for the task.
- Energy Consumption (): Integrated GPU power over inference duration, directly measured via NVIDIA Management Library (NVML) or the Zeus profiling tool at millisecond granularity.
- Latency (): Time from model-launch to first-token; queuing effects are excluded.
The reward function scalarizes these objectives, allowing explicit control over the accuracy-energy trade-off via the parameter .
4. Experimental Evaluation and Results
GreenServ was benchmarked using five tasks (500 samples each): MMLU (QA), HellaSwag (commonsense), Winogrande (WSC), GSM8K (math reasoning), CNN/DailyMail (summarization). The model pool includes 16 open-access LLMs: Qwen2.5-{0.5,1.5,3,7,14}B, Mistral-7B, Gemma3-{1,4,12,27}B, Llama3-{1,3,8}B, Phi-4-mini-4B, Phi-4-14B, Yi-34B.
Baselines include random routing, static single-model selection, non-contextual and contextual ε-Greedy, and contextual Thompson Sampling.
Key findings (at , 50 runs):
- GreenServ (LinUCB) vs. Random: +22% accuracy (≈0.65 vs. 0.53), –31% energy consumption (≈165 Wh vs. 240 Wh).
- Compared to smallest model: +400% accuracy, +400% energy.
- Compared to highest-accuracy static: –75% energy for only –10% accuracy.
- Pareto front: GreenServ and contextual bandits are consistently at or above the static Pareto frontier.
- Regret after 2500 queries: LinUCB ≈412, Contextual TS ≈400, non-contextual ε-Greedy ≈466.
- RouterBench external validation (36K queries, 9 tasks): GreenServ (LinUCB) peak 75.7%, avg 71.7% accuracy, AIQ=0.607.
5. Policy Update and Online Adaptation
Online adaptation is achieved through Algorithm 1 (LinUCB-based router):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Initialize A_m ← I_d, b_m ← 0 (for all models m) for t = 1 … T do x_t ← GenerateContext(q_t) M*_t ← {m | L_m(q_t) ≤ L_max} for each m in M*_t do θ_m ← A_m⁻¹ b_m bonus_m ← α · sqrt(x_tᵀ A_m⁻¹ x_t) score_m ← θ_mᵀ x_t + bonus_m end m_t ← argmax_{m∈M*_t} score_m response ← Inference(m_t, q_t) measure Acc, Energy from response r_t ← (1−λ)·Acc − λ·Energy A_{m_t} ← A_{m_t} + x_t x_tᵀ b_{m_t} ← b_{m_t} + r_t x_t end |
6. Strengths, Limitations, and Areas for Future Study
Strengths:
- GreenServ utilizes multi-feature context (task, semantic cluster, text complexity) for fine-grained model-query matching.
- The LinUCB online learning framework is robust to nonstationarity and allows immediate integration of new models without costly calibration.
- Direct, hardware-level measurement of energy consumption enhances metric fidelity.
- Routing incurs low computational overhead (≈7 ms per query) relative to inference time (40–200 ms).
Limitations:
- Assumes stationarity in reward; slow adaptation to concept drift may occur. Decay or recalibration methods could mitigate this.
- Evaluation focuses on tasks with objective ground truth; extension to open-ended generation will require alternative feedback mechanisms.
- Hardware profiling is necessary for generalizing to GPUs or TPUs beyond those tested.
- Feature engineering sensitivity: performance depends on choice and granularity of clusters, bins, and embedding models.
- Real-world deployment aspects such as concurrency, batching, queuing, and memory management require further examination.
This suggests that contextual, bandit-based routing mechanisms such as GreenServ enable near-Pareto-optimal trade-offs in multi-model LLM inference scenarios, achieving strong accuracy and energy efficiency without substantial overhead or calibration requirements (Ziller et al., 24 Jan 2026).