SGLang Runtime: Scalable LLM Inference

Updated 12 December 2025

SGLang Runtime is an execution environment and optimization framework that integrates a front-end DSL, graph-tracing compiler, and SGVM runtime for scalable LLM inference.
It features innovative cache strategies like RadixAttention and disk-based LSM-tree storage, ensuring efficient key-value reuse and resource-managed batching.
Adaptive control logic and parallel prompt primitives drive throughput gains, reduced first-token latency, and optimal resource allocation for multi-turn and retrieval-augmented tasks.

SGLang Runtime is an execution environment and optimization framework designed to support the efficient operation of structured LLM programs, emphasizing scalable key-value (KV) cache management, seamless parallelism, prompt program abstraction, and hardware-efficient inference. SGLang runtime integrates a front-end embedded domain-specific language (DSL), a graph-tracing compiler, and a runtime engine featuring novel cache reuse (RadixAttention), resource-managed batching, and production-grade disk-based LSM-tree KV storage for long-context and multi-process scenarios. The runtime accelerates complex reasoning, agent, and retrieval-augmented systems by maximizing KV reuse, coordinating resource allocation, and dynamically adapting to evolving workload characteristics (Zheng et al., 2023, &&&1&&&).

1. Layered System Architecture and Workflow

SGLang runtime incorporates a tripartite architecture: front-end primitives, compiler/tracer, and the SGVM runtime. The interface exposes Python-based prompt programming primitives (gen, select, extend, fork, join, run) to end-users, which are traced during execution to construct a program-specific dataflow graph. The graph executor launches stream executors for each prompt stream, dispatching operations to the underlying SGVM runtime.

The runtime coordinates three major subsystems:

Inference Scheduler: Invokes put_batch, probe, get_batch on the KV store for prompt streams.
Adaptive Controller: Monitors KV usage and tunes LSM-tree parameters (T: size ratio, K: run limit) using dynamic workload analysis.
Prefix-Preserving Storage Engine: Implements an LSM-index for metadata, an append-only tensor-log for payloads, and exposes batch operations and resource-managed job scheduling.

This separation of concerns supports both main-memory radix attention (for hot prefixes) and disk-backed scalable caching (for persistent and high-capacity usage), integrating seamlessly with multi-turn, retrieval, and agent pipelines (Yu et al., 20 Nov 2025, Zheng et al., 2023).

2. Programming Primitives and Prompt Graphs

SGLang extends Python with prompt-centric DSL primitives facilitating asynchrony, parallelism, and control flow:

gen(prompt): Asynchronous LLM call; scheduled with cache-aware batching to maximize KV prefix reuse.
select(prompt, choices): Constrained logit selection.
fork(prompt, k): Branches prompt stream k ways for parallel generation.
join(prompts, merge_fn): Merges multiple streams into one via a user-supplied merge function.
extend/+=, run, decorator: Support looped and conditional execution, embedding prompts as first-class citizens alongside structured I/O (JSON, tables).
Graph IR: Trace-based compilation yields an execution graph (nodes: ConstantText, Gen, Select, Fork, Join; edges: data dependencies) guiding resource scheduling, code-movement, and prefetch optimization.

This abstraction enables compositional and parallel program structures (tree-of-thought, retrieval-processing, multi-agent plans) while hiding scheduler and memory details (Zheng et al., 2023).

3. RadixAttention and KV-Cache Reuse

RadixAttention is the central cache optimization within SGLang runtime (Zheng et al., 2023). It organizes GPU KV-cache pages within a radix tree indexed by token sequences:

Radix Tree (CPU): Maps prefix substrings to GPU memory pool page pointers; maintains LRU timestamp and active reference count.
GPU Memory Pool: Stores KV arrays per token, allowing cache reuse for shared prefixes.
Process Algorithm: For each prompt request, matches the longest prefix in the radix tree; reuses pages for matched tokens; allocates new pages for unmatched suffix; evicts least-used leaves when space is needed.

Pseudocode:

function process_request(x):
    (node, ℓ) = T.match_prefix(x)  # O(m) substring comps
    pages = node.kv_pages[1:ℓ]
    new_tokens = x[ℓ+1:N]
    allocated = P.alloc(len(new_tokens))
    if not allocated:
        T.evict_pages(len(new_tokens))
        allocated = P.alloc(len(new_tokens))
    kv_new = extend_kernel(pages, new_tokens)
    for i, tok in enumerate(new_tokens):
        node = T.insert_child(node, tok)
        node.kv_page = kv_new[i]
    return decode_output(kv_new)

Computational efficiency is

O(N)

for prefix match,

O(1)

for page lookup, and memory cost is

O(M \times d)

for

M

tokens and dimension

d

. RadixAttention minimizes re-prefill compute: only unique tokens beyond the shared prefix are forwarded to GPU, yielding

4\times-5.6\times

speedups on reasoning and agent benchmarks (Zheng et al., 2023).

4. Scalable Disk-Based KV Cache: LSMTree Storage Engine

To support large-context and persistent caching, SGLang runtime incorporates SGLANG-LSM (Yu et al., 20 Nov 2025), a database-inspired prefix-preserving KV cache:

LSM-Tree Index: Stores prefix-encoded token keys and metadata; sorted runs at each level merged via compaction; supports range scans for longest shared prefix detection.
Tensor-Log: Large KV tensors stored in append-only disk files; no data rewrite during compaction.
Batch Operations:
- put_batch(tokens[1…m], tensors[1…m]): Appends tensors, generates WriteBatch, atomically updates LSM.
- probe(token_prefix): Binary search in LSM for prefix; returns payload if found.
- get_batch(token_ids[1…k]): Range scan, cluster contiguous offsets, then bulk read tensors.

Complexity is $O(m·\log N + \mathrm{size}_{\mathrm{bytes}}/B)$ for write, $O(\log N + k + \mathrm{size}_{\mathrm{read}}/B)$ for batch read. Prefix-preserving key layout ensures tokens sharing prefixes are physically co-located, improving scan and probe hit-rate. Empirical hit-rate model: $\mathrm{HitRate} \approx 1 - \exp(-\lambda S)$ (Yu et al., 20 Nov 2025).

5. Adaptive Control Logic and Resource Management

The adaptive controller within SGLang-LSM dynamically tunes the LSM-tree’s architectural parameters:

Counters: Monitors writes (W), hits (Q), reads (R), misses (Z) over sliding window.
Objective: $C(T,K) = w·W(T,K) + s·S(T,K) + r·R(T,K) + z·Z(T,K)$ , subject to $T ≥ 2, 1 ≤ K < T$ .
Dynamic Reconfiguration: Periodically solves for optimal $(T^*, K^*)$ minimizing cost; applies changes lazily on flush or compaction boundaries to avoid full-tree rewrite.
Runtime Services:
- Batch Codec/Compression: Compresses tensor batches for disk writes.
- Auto Tensor-File Merge: Merges small files in background when threshold $F_{\max}$ exceeded.
- Job Scheduler: Dispatches compaction and merge jobs based on CPU/I/O availability.

Concurrency is safeguarded via atomic two-phase commit; crash recovery is handled via metadata-payload decoupling, ensuring unreachable payloads are garbage-collected (Yu et al., 20 Nov 2025).

6. Integration, Batching, and System-Level Optimizations

SGLang runtime is designed for integration with diverse LLM inference pipelines:

Multi-turn/Chat: RadixAttention matches chat history prefixes for cache efficiency.
Retrieval-Augmented Generation: Batch-prefill for document segments; join KV trees for parallel contexts.
Few-shot/Reasoning: Fork/join primitives enable both prompt-level and agent-level parallelization.
System-Level Optimizations:
- Graph Executor: Parallel stream execution with dependency-aware scheduling.
- Custom Kernels: Prefill, decode, and “extend” kernels efficiently operate on non-contiguous KV allocations using hardware-specific (Triton/CUDA) routines.

On hardware, SGLang’s memory pool supports hot prefix prefetch (CPU→GPU), with negligible (<1%) overhead for radix tree maintenance. Code-movement and prefetch insertion in the compiler can reduce first-token latency by up to 80% (Zheng et al., 2023).

7. Performance and Empirical Evaluation

Empirical evaluation demonstrates substantial runtime improvements:

Configuration	TTFT (4k)	HitRate (%)	TTFT (16k)	HitRate (%)
SGLang(memory)	0.45 s	32.1 %	1.12 s	15.2 %
SGLang(file)	0.51 s	18.7 %	2.35 s	33.5 %
SGLANG-LSM	0.41 s	45.4 %	1.78 s	81.6 %

SGLANG-LSM achieves a 143% relative improvement in cache hit-rate (33.5→81.6%) and a 24% reduction in TTFT (2.35 s→1.78 s) on 16k token prompts versus state-of-the-art file-based systems. For agent and multi-chain reasoning tasks, SGLang runtime yields up to $5.6\times$ throughput gains and 80% first-token latency reduction through cache-aware scheduling and parallel graph execution (Yu et al., 20 Nov 2025, Zheng et al., 2023).

8. Context and Significance

SGLang runtime and its LSM-powered extension represent the first systematic application of database architectures (LSM trees, key-value separation, adaptive control) to large-scale LLM KV cache management (Yu et al., 20 Nov 2025). The co-design philosophy encompassing language abstractions, graph compilation, radix-based cache strategies, and disk-based, adaptively tuned storage positions SGLang as an enabling technology for high-throughput, multi-process, and long-context LLM applications across agent, retrieval, and reasoning domains (Zheng et al., 2023). A plausible implication is that hybrid in-memory/disk KV caching architectures will be increasingly critical for scaling LLM inference in low-latency and high-concurrency settings.

Markdown Report Issue Upgrade to Chat

References (2)

SGLang: Efficient Execution of Structured Language Model Programs (2023)

On 10x Better Scalability: KV Stores Scale Up KV Cache (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SGLang Runtime.