ThunderAgent: Program-Aware LLM Orchestration
- ThunderAgent is a program-aware inference system that unifies multi-turn LLM dialogues and external tool invocations through the Agentic Program abstraction, addressing inefficiencies like KV-cache thrashing and uneven HBM usage.
- It leverages a specialized scheduler that minimizes recompute, unused memory, and caching overhead by coordinating LLM and tool resource lifecycles via dynamic pause and restore operations.
- Benchmark results show significant improvements, with serving throughput gains of 1.48–3.58× and disk savings up to 4.2× over conventional orchestration systems.
ThunderAgent is a program-aware agentic inference system designed to orchestrate LLM-centric workflows involving both multi-turn dialogue and external tool invocations. Unlike conventional systems that schedule each request in isolation using disparate orchestration (e.g., vLLM for LLM inference and Kubernetes for tools), ThunderAgent elevates the abstraction unit to an "Agentic Program." This unified approach addresses inefficiencies in GPU memory (HBM) usage, key-value (KV) cache management, and tool environment lifecycles, resulting in measurable improvements in throughput and resource utilization (Kang et al., 14 Feb 2026).
1. Motivation and Core Challenges
Contemporary agentic inference pipelines—where LLMs alternate between reasoning steps and tool executions—suffer from architectural fragmentation. Key systemic inefficiencies arise from:
- KV-Cache Thrashing: When a workflow transitions from LLM inference to a tool call, the program’s KV-cache remains resident in HBM and is frequently evicted by incoming requests. On tool return, the necessary context must be recomputed, with latency inflation of up to 7.1×.
- Cross-Node Memory Imbalance: Pinning each agentic workflow to a single distributed processing (DP) node maximizes local reuse but leads to HBM utilization skew: some nodes become hotspots while others remain underutilized, resulting in suboptimal cluster efficiency.
- Tool Lifecycle Obliviousness: Independent orchestration of tool environments causes resource leaks (unused sandboxes, Docker images, and ports) and blocking due to serial environment initialization.
ThunderAgent’s architectural innovation is to treat the multi-turn agentic workflow (not the single request) as the unit of scheduling, thus enabling coordinated management of both LLM and tool resources.
2. Program Abstraction: LLM Programs
ThunderAgent formally models each agentic workflow as an "Agentic Program," defined as:
where:
- : unique program identifier
- : context length, determining required KV-cache
- : set of required tool environments (e.g., disk images, network ports)
- : assigned backend node, or if paused
- : phase—Reasoning () or Acting (tool call, )
- : status
The program’s lifecycle is governed by state transitions:
- : unbinds from its node, evicts KV-cache, sets
- : assigns to an available node, reloads KV-cache, sets
This abstraction provides a unified surface for resource managers and schedulers across both LLM and tool components.
3. Program-Aware Scheduling and Cost Model
Space–Time Product (STP) Decomposition
ThunderAgent introduces a scheduler to maximize throughput under HBM constraints via minimizing three GPU resource overheads:
- : Arises from re-populating KV-cache after eviction.
- : Due to idle memory from uneven load distribution.
- : From holding KV-cache of inactive programs awaiting tool return.
Goal: Minimize total overhead subject to capacity constraint .
Periodic Thrashing Detection and Time-Decay
At every interval , each DP backend evaluates aggregate memory use: For acting-phase () programs, decays their HBM footprint (e.g., ), gradually freeing capacity for high-throughput operation.
Optimal Eviction and Load Balancing
Eviction policy follows a minimal recompute cost by removing programs with smallest first (proven optimal via convexity arguments). Load imbalance is mitigated by a global paused-program queue: when a node has capacity, it pulls the highest-priority program, bounding idle HBM by the minimal paused context.
Scheduler Algorithm Overview (verbatim from the data)
1 2 3 4 5 6 7 8 9 |
While running:
sleep(Δt)
Compute aggregate memory U
If U > C_total:
Determine ΔC = U - C_total
Select smallest programs for eviction until reclaim ≥ ΔC
Pause each selected program; push to global queue
Else if U < C_total and queue nonempty:
Pop program from global queue; restore it to node |
4. Tool Resource Management
Each program tracks , the required set of tool environments. The Tool Resource Manager coordinates:
- Environment Lifecycle: Ensures is initialized upon activation, or prefetched asynchronously for high-priority restores. On termination, all associated disk/port resources are reclaimed via lifecycle hooks.
- Asynchronous Preparation: For with high restore score , is setup in the background, overlapping I/O cost with GPU computation.
This approach avoids linear disk and port accumulation and mitigates tool environment initialization as a throughput bottleneck.
5. System Architecture and Implementation
ThunderAgent implements a lightweight middleware over standard LLM engines (vLLM, SGLang) and orchestrators (Docker, Kubernetes). The system comprises:
- Program Manager: Maintains states of active/paused programs, exposing program state and Pause/Restore/Terminate interfaces.
- Scheduler Workers: Deployed per DP node; interact with Program Manager and global paused queue.
- Global Queue Service: Concurrent in-memory priority queue for paused programs.
- Tool Resource Manager: Manages tool allocation, disposal, and prefetch.
Key data structure optimizations include:
- Per-token update of and via engine callbacks.
- Eviction heap sorted by for pause operations.
- Bounded thread pool for asynchronous tool prefetch to limit resource contention.
6. Performance Evaluation
ThunderAgent's efficacy was assessed across coding, routing, and scientific discovery agents:
| Metric | ThunderAgent vs vLLM | ThunderAgent vs Continuum | Disk Savings | RL Rollout Speedups |
|---|---|---|---|---|
| Serving Throughput (×) | 1.48–3.58 | 1.17–3.31 | up to 4.2× | 1.79–3.92× |
Benchmarks included OpenHands and mini-SWEAgent on SWE-Bench-Lite, ToolOrchestra on HLE, and OpenHands on ScienceAgentBench. Serving experiments used up to 96 concurrent programs; distributed RL rollouts utilized two 8× H100 nodes.
- KV-cache hit rates: near 100% in deterministic scenarios; gracefully degraded under stochastic tool latencies.
- Throughput improvements were primarily due to reduction in KV thrashing, overlapping of environment setup, and HBM memory balancing.
7. Limitations and Prospective Directions
ThunderAgent’s scheduler hyperparameters (, decay ) require tuning, though performance was robust to a broad range in ablation studies. Heavy-tailed tool execution distributions may cause decay settings to prematurely evict reusable caches; adaptive online tuning of is a future extension.
Integrating with advanced tiered or offloaded KV-cache systems (such as Pensieve, Strata) could further extend effective HBM but necessitates program-aware placement to avoid I/O bottlenecks. Open challenges include extending scheduling to heterogenous GPU/TPU clusters and supporting robust multi-tenant isolation.
ThunderAgent’s program-aware management of LLM and tool resources offers a unified and high-throughput approach to agentic inference and RL rollout (Kang et al., 14 Feb 2026).