Papers
Topics
Authors
Recent
Search
2000 character limit reached

SGLang Runtime: Scalable LLM Inference

Updated 12 December 2025
  • SGLang Runtime is an execution environment and optimization framework that integrates a front-end DSL, graph-tracing compiler, and SGVM runtime for scalable LLM inference.
  • It features innovative cache strategies like RadixAttention and disk-based LSM-tree storage, ensuring efficient key-value reuse and resource-managed batching.
  • Adaptive control logic and parallel prompt primitives drive throughput gains, reduced first-token latency, and optimal resource allocation for multi-turn and retrieval-augmented tasks.

SGLang Runtime is an execution environment and optimization framework designed to support the efficient operation of structured LLM programs, emphasizing scalable key-value (KV) cache management, seamless parallelism, prompt program abstraction, and hardware-efficient inference. SGLang runtime integrates a front-end embedded domain-specific language (DSL), a graph-tracing compiler, and a runtime engine featuring novel cache reuse (RadixAttention), resource-managed batching, and production-grade disk-based LSM-tree KV storage for long-context and multi-process scenarios. The runtime accelerates complex reasoning, agent, and retrieval-augmented systems by maximizing KV reuse, coordinating resource allocation, and dynamically adapting to evolving workload characteristics (Zheng et al., 2023, &&&1&&&).

1. Layered System Architecture and Workflow

SGLang runtime incorporates a tripartite architecture: front-end primitives, compiler/tracer, and the SGVM runtime. The interface exposes Python-based prompt programming primitives (gen, select, extend, fork, join, run) to end-users, which are traced during execution to construct a program-specific dataflow graph. The graph executor launches stream executors for each prompt stream, dispatching operations to the underlying SGVM runtime.

The runtime coordinates three major subsystems:

  • Inference Scheduler: Invokes put_batch, probe, get_batch on the KV store for prompt streams.
  • Adaptive Controller: Monitors KV usage and tunes LSM-tree parameters (T: size ratio, K: run limit) using dynamic workload analysis.
  • Prefix-Preserving Storage Engine: Implements an LSM-index for metadata, an append-only tensor-log for payloads, and exposes batch operations and resource-managed job scheduling.

This separation of concerns supports both main-memory radix attention (for hot prefixes) and disk-backed scalable caching (for persistent and high-capacity usage), integrating seamlessly with multi-turn, retrieval, and agent pipelines (Yu et al., 20 Nov 2025, Zheng et al., 2023).

2. Programming Primitives and Prompt Graphs

SGLang extends Python with prompt-centric DSL primitives facilitating asynchrony, parallelism, and control flow:

  • gen(prompt): Asynchronous LLM call; scheduled with cache-aware batching to maximize KV prefix reuse.
  • select(prompt, choices): Constrained logit selection.
  • fork(prompt, k): Branches prompt stream k ways for parallel generation.
  • join(prompts, merge_fn): Merges multiple streams into one via a user-supplied merge function.
  • extend/+=, run, decorator: Support looped and conditional execution, embedding prompts as first-class citizens alongside structured I/O (JSON, tables).
  • Graph IR: Trace-based compilation yields an execution graph (nodes: ConstantText, Gen, Select, Fork, Join; edges: data dependencies) guiding resource scheduling, code-movement, and prefetch optimization.

This abstraction enables compositional and parallel program structures (tree-of-thought, retrieval-processing, multi-agent plans) while hiding scheduler and memory details (Zheng et al., 2023).

3. RadixAttention and KV-Cache Reuse

RadixAttention is the central cache optimization within SGLang runtime (Zheng et al., 2023). It organizes GPU KV-cache pages within a radix tree indexed by token sequences:

  • Radix Tree (CPU): Maps prefix substrings to GPU memory pool page pointers; maintains LRU timestamp and active reference count.
  • GPU Memory Pool: Stores KV arrays per token, allowing cache reuse for shared prefixes.
  • Process Algorithm: For each prompt request, matches the longest prefix in the radix tree; reuses pages for matched tokens; allocates new pages for unmatched suffix; evicts least-used leaves when space is needed.

Pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
function process_request(x):
    (node, ℓ) = T.match_prefix(x)  # O(m) substring comps
    pages = node.kv_pages[1:ℓ]
    new_tokens = x[ℓ+1:N]
    allocated = P.alloc(len(new_tokens))
    if not allocated:
        T.evict_pages(len(new_tokens))
        allocated = P.alloc(len(new_tokens))
    kv_new = extend_kernel(pages, new_tokens)
    for i, tok in enumerate(new_tokens):
        node = T.insert_child(node, tok)
        node.kv_page = kv_new[i]
    return decode_output(kv_new)
Computational efficiency is O(N)O(N) for prefix match, O(1)O(1) for page lookup, and memory cost is O(M×d)O(M \times d) for MM tokens and dimension dd. RadixAttention minimizes re-prefill compute: only unique tokens beyond the shared prefix are forwarded to GPU, yielding 4×5.6×4\times-5.6\times speedups on reasoning and agent benchmarks (Zheng et al., 2023).

4. Scalable Disk-Based KV Cache: LSMTree Storage Engine

To support large-context and persistent caching, SGLang runtime incorporates SGLANG-LSM (Yu et al., 20 Nov 2025), a database-inspired prefix-preserving KV cache:

  • LSM-Tree Index: Stores prefix-encoded token keys and metadata; sorted runs at each level merged via compaction; supports range scans for longest shared prefix detection.
  • Tensor-Log: Large KV tensors stored in append-only disk files; no data rewrite during compaction.
  • Batch Operations:
    • put_batch(tokens[1…m], tensors[1…m]): Appends tensors, generates WriteBatch, atomically updates LSM.
    • probe(token_prefix): Binary search in LSM for prefix; returns payload if found.
    • get_batch(token_ids[1…k]): Range scan, cluster contiguous offsets, then bulk read tensors.

Complexity is O(mlogN+sizebytes/B)O(m·\log N + \mathrm{size}_{\mathrm{bytes}}/B) for write, O(logN+k+sizeread/B)O(\log N + k + \mathrm{size}_{\mathrm{read}}/B) for batch read. Prefix-preserving key layout ensures tokens sharing prefixes are physically co-located, improving scan and probe hit-rate. Empirical hit-rate model: HitRate1exp(λS)\mathrm{HitRate} \approx 1 - \exp(-\lambda S) (Yu et al., 20 Nov 2025).

5. Adaptive Control Logic and Resource Management

The adaptive controller within SGLang-LSM dynamically tunes the LSM-tree’s architectural parameters:

  • Counters: Monitors writes (W), hits (Q), reads (R), misses (Z) over sliding window.
  • Objective: C(T,K)=wW(T,K)+sS(T,K)+rR(T,K)+zZ(T,K)C(T,K) = w·W(T,K) + s·S(T,K) + r·R(T,K) + z·Z(T,K), subject to T2,1K<TT ≥ 2, 1 ≤ K < T.
  • Dynamic Reconfiguration: Periodically solves for optimal (T,K)(T^*, K^*) minimizing cost; applies changes lazily on flush or compaction boundaries to avoid full-tree rewrite.
  • Runtime Services:
    • Batch Codec/Compression: Compresses tensor batches for disk writes.
    • Auto Tensor-File Merge: Merges small files in background when threshold FmaxF_{\max} exceeded.
    • Job Scheduler: Dispatches compaction and merge jobs based on CPU/I/O availability.

Concurrency is safeguarded via atomic two-phase commit; crash recovery is handled via metadata-payload decoupling, ensuring unreachable payloads are garbage-collected (Yu et al., 20 Nov 2025).

6. Integration, Batching, and System-Level Optimizations

SGLang runtime is designed for integration with diverse LLM inference pipelines:

  • Multi-turn/Chat: RadixAttention matches chat history prefixes for cache efficiency.
  • Retrieval-Augmented Generation: Batch-prefill for document segments; join KV trees for parallel contexts.
  • Few-shot/Reasoning: Fork/join primitives enable both prompt-level and agent-level parallelization.
  • System-Level Optimizations:
    • Graph Executor: Parallel stream execution with dependency-aware scheduling.
    • Custom Kernels: Prefill, decode, and “extend” kernels efficiently operate on non-contiguous KV allocations using hardware-specific (Triton/CUDA) routines.

On hardware, SGLang’s memory pool supports hot prefix prefetch (CPU→GPU), with negligible (<1%) overhead for radix tree maintenance. Code-movement and prefetch insertion in the compiler can reduce first-token latency by up to 80% (Zheng et al., 2023).

7. Performance and Empirical Evaluation

Empirical evaluation demonstrates substantial runtime improvements:

Configuration TTFT (4k) HitRate (%) TTFT (16k) HitRate (%)
SGLang(memory) 0.45 s 32.1 % 1.12 s 15.2 %
SGLang(file) 0.51 s 18.7 % 2.35 s 33.5 %
SGLANG-LSM 0.41 s 45.4 % 1.78 s 81.6 %

SGLANG-LSM achieves a 143% relative improvement in cache hit-rate (33.5→81.6%) and a 24% reduction in TTFT (2.35 s→1.78 s) on 16k token prompts versus state-of-the-art file-based systems. For agent and multi-chain reasoning tasks, SGLang runtime yields up to 5.6×5.6\times throughput gains and 80% first-token latency reduction through cache-aware scheduling and parallel graph execution (Yu et al., 20 Nov 2025, Zheng et al., 2023).

8. Context and Significance

SGLang runtime and its LSM-powered extension represent the first systematic application of database architectures (LSM trees, key-value separation, adaptive control) to large-scale LLM KV cache management (Yu et al., 20 Nov 2025). The co-design philosophy encompassing language abstractions, graph compilation, radix-based cache strategies, and disk-based, adaptively tuned storage positions SGLang as an enabling technology for high-throughput, multi-process, and long-context LLM applications across agent, retrieval, and reasoning domains (Zheng et al., 2023). A plausible implication is that hybrid in-memory/disk KV caching architectures will be increasingly critical for scaling LLM inference in low-latency and high-concurrency settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SGLang Runtime.