SGLang Runtime: Scalable LLM Inference
- SGLang Runtime is an execution environment and optimization framework that integrates a front-end DSL, graph-tracing compiler, and SGVM runtime for scalable LLM inference.
- It features innovative cache strategies like RadixAttention and disk-based LSM-tree storage, ensuring efficient key-value reuse and resource-managed batching.
- Adaptive control logic and parallel prompt primitives drive throughput gains, reduced first-token latency, and optimal resource allocation for multi-turn and retrieval-augmented tasks.
SGLang Runtime is an execution environment and optimization framework designed to support the efficient operation of structured LLM programs, emphasizing scalable key-value (KV) cache management, seamless parallelism, prompt program abstraction, and hardware-efficient inference. SGLang runtime integrates a front-end embedded domain-specific language (DSL), a graph-tracing compiler, and a runtime engine featuring novel cache reuse (RadixAttention), resource-managed batching, and production-grade disk-based LSM-tree KV storage for long-context and multi-process scenarios. The runtime accelerates complex reasoning, agent, and retrieval-augmented systems by maximizing KV reuse, coordinating resource allocation, and dynamically adapting to evolving workload characteristics (Zheng et al., 2023, &&&1&&&).
1. Layered System Architecture and Workflow
SGLang runtime incorporates a tripartite architecture: front-end primitives, compiler/tracer, and the SGVM runtime. The interface exposes Python-based prompt programming primitives (gen, select, extend, fork, join, run) to end-users, which are traced during execution to construct a program-specific dataflow graph. The graph executor launches stream executors for each prompt stream, dispatching operations to the underlying SGVM runtime.
The runtime coordinates three major subsystems:
- Inference Scheduler: Invokes put_batch, probe, get_batch on the KV store for prompt streams.
- Adaptive Controller: Monitors KV usage and tunes LSM-tree parameters (T: size ratio, K: run limit) using dynamic workload analysis.
- Prefix-Preserving Storage Engine: Implements an LSM-index for metadata, an append-only tensor-log for payloads, and exposes batch operations and resource-managed job scheduling.
This separation of concerns supports both main-memory radix attention (for hot prefixes) and disk-backed scalable caching (for persistent and high-capacity usage), integrating seamlessly with multi-turn, retrieval, and agent pipelines (Yu et al., 20 Nov 2025, Zheng et al., 2023).
2. Programming Primitives and Prompt Graphs
SGLang extends Python with prompt-centric DSL primitives facilitating asynchrony, parallelism, and control flow:
- gen(prompt): Asynchronous LLM call; scheduled with cache-aware batching to maximize KV prefix reuse.
- select(prompt, choices): Constrained logit selection.
- fork(prompt, k): Branches prompt stream k ways for parallel generation.
- join(prompts, merge_fn): Merges multiple streams into one via a user-supplied merge function.
- extend/+=, run, decorator: Support looped and conditional execution, embedding prompts as first-class citizens alongside structured I/O (JSON, tables).
- Graph IR: Trace-based compilation yields an execution graph (nodes: ConstantText, Gen, Select, Fork, Join; edges: data dependencies) guiding resource scheduling, code-movement, and prefetch optimization.
This abstraction enables compositional and parallel program structures (tree-of-thought, retrieval-processing, multi-agent plans) while hiding scheduler and memory details (Zheng et al., 2023).
3. RadixAttention and KV-Cache Reuse
RadixAttention is the central cache optimization within SGLang runtime (Zheng et al., 2023). It organizes GPU KV-cache pages within a radix tree indexed by token sequences:
- Radix Tree (CPU): Maps prefix substrings to GPU memory pool page pointers; maintains LRU timestamp and active reference count.
- GPU Memory Pool: Stores KV arrays per token, allowing cache reuse for shared prefixes.
- Process Algorithm: For each prompt request, matches the longest prefix in the radix tree; reuses pages for matched tokens; allocates new pages for unmatched suffix; evicts least-used leaves when space is needed.
Pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
function process_request(x):
(node, ℓ) = T.match_prefix(x) # O(m) substring comps
pages = node.kv_pages[1:ℓ]
new_tokens = x[ℓ+1:N]
allocated = P.alloc(len(new_tokens))
if not allocated:
T.evict_pages(len(new_tokens))
allocated = P.alloc(len(new_tokens))
kv_new = extend_kernel(pages, new_tokens)
for i, tok in enumerate(new_tokens):
node = T.insert_child(node, tok)
node.kv_page = kv_new[i]
return decode_output(kv_new) |
4. Scalable Disk-Based KV Cache: LSMTree Storage Engine
To support large-context and persistent caching, SGLang runtime incorporates SGLANG-LSM (Yu et al., 20 Nov 2025), a database-inspired prefix-preserving KV cache:
- LSM-Tree Index: Stores prefix-encoded token keys and metadata; sorted runs at each level merged via compaction; supports range scans for longest shared prefix detection.
- Tensor-Log: Large KV tensors stored in append-only disk files; no data rewrite during compaction.
- Batch Operations:
- put_batch(tokens[1…m], tensors[1…m]): Appends tensors, generates WriteBatch, atomically updates LSM.
- probe(token_prefix): Binary search in LSM for prefix; returns payload if found.
- get_batch(token_ids[1…k]): Range scan, cluster contiguous offsets, then bulk read tensors.
Complexity is for write, for batch read. Prefix-preserving key layout ensures tokens sharing prefixes are physically co-located, improving scan and probe hit-rate. Empirical hit-rate model: (Yu et al., 20 Nov 2025).
5. Adaptive Control Logic and Resource Management
The adaptive controller within SGLang-LSM dynamically tunes the LSM-tree’s architectural parameters:
- Counters: Monitors writes (W), hits (Q), reads (R), misses (Z) over sliding window.
- Objective: , subject to .
- Dynamic Reconfiguration: Periodically solves for optimal minimizing cost; applies changes lazily on flush or compaction boundaries to avoid full-tree rewrite.
- Runtime Services:
- Batch Codec/Compression: Compresses tensor batches for disk writes.
- Auto Tensor-File Merge: Merges small files in background when threshold exceeded.
- Job Scheduler: Dispatches compaction and merge jobs based on CPU/I/O availability.
Concurrency is safeguarded via atomic two-phase commit; crash recovery is handled via metadata-payload decoupling, ensuring unreachable payloads are garbage-collected (Yu et al., 20 Nov 2025).
6. Integration, Batching, and System-Level Optimizations
SGLang runtime is designed for integration with diverse LLM inference pipelines:
- Multi-turn/Chat: RadixAttention matches chat history prefixes for cache efficiency.
- Retrieval-Augmented Generation: Batch-prefill for document segments; join KV trees for parallel contexts.
- Few-shot/Reasoning: Fork/join primitives enable both prompt-level and agent-level parallelization.
- System-Level Optimizations:
- Graph Executor: Parallel stream execution with dependency-aware scheduling.
- Custom Kernels: Prefill, decode, and “extend” kernels efficiently operate on non-contiguous KV allocations using hardware-specific (Triton/CUDA) routines.
On hardware, SGLang’s memory pool supports hot prefix prefetch (CPU→GPU), with negligible (<1%) overhead for radix tree maintenance. Code-movement and prefetch insertion in the compiler can reduce first-token latency by up to 80% (Zheng et al., 2023).
7. Performance and Empirical Evaluation
Empirical evaluation demonstrates substantial runtime improvements:
| Configuration | TTFT (4k) | HitRate (%) | TTFT (16k) | HitRate (%) |
|---|---|---|---|---|
| SGLang(memory) | 0.45 s | 32.1 % | 1.12 s | 15.2 % |
| SGLang(file) | 0.51 s | 18.7 % | 2.35 s | 33.5 % |
| SGLANG-LSM | 0.41 s | 45.4 % | 1.78 s | 81.6 % |
SGLANG-LSM achieves a 143% relative improvement in cache hit-rate (33.5→81.6%) and a 24% reduction in TTFT (2.35 s→1.78 s) on 16k token prompts versus state-of-the-art file-based systems. For agent and multi-chain reasoning tasks, SGLang runtime yields up to throughput gains and 80% first-token latency reduction through cache-aware scheduling and parallel graph execution (Yu et al., 20 Nov 2025, Zheng et al., 2023).
8. Context and Significance
SGLang runtime and its LSM-powered extension represent the first systematic application of database architectures (LSM trees, key-value separation, adaptive control) to large-scale LLM KV cache management (Yu et al., 20 Nov 2025). The co-design philosophy encompassing language abstractions, graph compilation, radix-based cache strategies, and disk-based, adaptively tuned storage positions SGLang as an enabling technology for high-throughput, multi-process, and long-context LLM applications across agent, retrieval, and reasoning domains (Zheng et al., 2023). A plausible implication is that hybrid in-memory/disk KV caching architectures will be increasingly critical for scaling LLM inference in low-latency and high-concurrency settings.